---
## **Ceph Block Storage**
### **Post Mortem of OTG0054944**
-----------
**Dan van der Ster, IT-ST**
*WLCG Ops Coordination, 5 March 2020*
---
### Impact on Grid Services (1)
* The CERN Batch Service was unavailable for job submission. Longer running jobs eventually failed due to losing touch with the batch schedds for too long. ([OTG:0054948](https://cern.service-now.com/service-portal/view-outage.do?n=OTG0054948))
* The CVMFS stratum-0 was unavailable for some hours ([OTG:005496](https://cern.service-now.com/service-portal/view-outage.do?n=OTG005496))
* HammerCloud was degraded ([OTG:0054991](https://cern.service-now.com/service-portal/view-outage.do?n=OTG0054991))
* The WLCG monitoring infrastructure was degraded, recovering the backlog some hours later ([OTG:0054946](https://cern.service-now.com/service-portal/view-outage.do?n=OTG0054946))
----
### Impact on Grid Services (2)
* Multiple VO services, both local and those supporting the experiments' distributed production and analysis were unavailable.
* Impact to VO services was summarised in the weekly WLCG operations meeting.
* Other non-Grid services were affected and can be found on the IT-SSB Service Incident page
* Notably lxplus, AIADM, acron, some AFS volumes, Gitlab and file access in Indico.
---
### Timeline of Feb 20, 2020
* 10:11 Ceph Block Storage cluster detects 25% of OSDs down.
* OpenStack Cinder volume IOs are stopped.
* 10:20 Start investigations:
* `ceph-osd` daemons which crashed aren't able to restart.
* Log files showing crc errors in the `osdmap`.
* 12:30 Checking with community: IRC + mailing list + bug tracker.
* 13:30 Problem + workaround understood (at a basic level).
* ~17:30 Critical power service restored.
* ~19:00 Main room service restored.
---
### Ceph Services at CERN
* Ceph is used for underlying storage in the OpenStack cloud:
* OpenStack block storage (1) and CephFS (2)
* S3 object storage (3)
* RADOS object storage for CASTOR (4) and CTA (5)
* HPC filesystems (6) and storage for Kopano email (7, 8)
* (each (#) is a unique cluster)
* The incident of Feb 20 affected cluster (1) only.
---
### Root Cause
* After one week of studying the outage in collaboration with upstream, a bug was found in the **LZ4 compression library**
* Versions < 1.8.2 can incorrectly compress data
* in extremely rare conditions
* involving fragmented input data buffers
* compressed via the LZ4 streaming API
* Note: CentOS 7 ships with lz4 version 1.7.5.
----
* LZ4 compression was first enabled in this cluster in Dec 2019.
* On Feb 20, the compression bug triggered a corruption in the `osdmap` central cluster metadata, causing the daemons to throw a crc error and exit.
* Compression has now been disabled (on Feb 25)
* Other data is not impacted, as it is compressed from large contiguous input buffers.
----
* We are highly confident in the root cause and workaround.
* We are able to reproduce the identical bit corruption seen in real life.
* Ceph and LZ4 developers are working with us to develop a final workaround for Ceph when running with buggy LZ4.
---
### Next Steps
* The outage revealed the high impact of a block storage outage.
* IT-ST is discussing how to offer several availability zones (analogous to the CERN cloud AZs)
* this will empower applications to be designed with higher levels of redundancy.
---
### Information for WLCG Ceph Users
* While lz4 is in question, it is advisable to disable compression or use alternative algorithms:
* `bluestore_compression_mode = none`
* `bluestore_compression_algorithm = snappy`
* snappy, zstd, zlib were all tested and do not exhibit the same `osdmap` corruption.
---
### References
* Ceph bug tracker: https://tracker.ceph.com/issues/39525
* Ceph pull request: https://github.com/ceph/ceph/pull/33584
* Technical post mortem: https://codimd.web.cern.ch/p/HJGnq3HVI#/
---
# ?
---
{"type":"slide","slideOptions":{"transition":"slide","theme":"cern5"},"slideNumber":true,"title":"2020-03-05 - Ceph Post Mortem","tags":"presentation, WLCG"}