2020-03-05 - Ceph Post Mortem

--- ## **Ceph Block Storage** ### **Post Mortem of OTG0054944** ----------- **Dan van der Ster, IT-ST** *WLCG Ops Coordination, 5 March 2020* --- ### Impact on Grid Services (1) * The CERN Batch Service was unavailable for job submission. Longer running jobs eventually failed due to losing touch with the batch schedds for too long. ([OTG:0054948](https://cern.service-now.com/service-portal/view-outage.do?n=OTG0054948)) * The CVMFS stratum-0 was unavailable for some hours ([OTG:005496](https://cern.service-now.com/service-portal/view-outage.do?n=OTG005496)) * HammerCloud was degraded ([OTG:0054991](https://cern.service-now.com/service-portal/view-outage.do?n=OTG0054991)) * The WLCG monitoring infrastructure was degraded, recovering the backlog some hours later ([OTG:0054946](https://cern.service-now.com/service-portal/view-outage.do?n=OTG0054946)) ---- ### Impact on Grid Services (2) * Multiple VO services, both local and those supporting the experiments' distributed production and analysis were unavailable. * Impact to VO services was summarised in the weekly WLCG operations meeting. * Other non-Grid services were affected and can be found on the IT-SSB Service Incident page * Notably lxplus, AIADM, acron, some AFS volumes, Gitlab and file access in Indico. --- ### Timeline of Feb 20, 2020 * 10:11 Ceph Block Storage cluster detects 25% of OSDs down. * OpenStack Cinder volume IOs are stopped. * 10:20 Start investigations: * `ceph-osd` daemons which crashed aren't able to restart. * Log files showing crc errors in the `osdmap`. * 12:30 Checking with community: IRC + mailing list + bug tracker. * 13:30 Problem + workaround understood (at a basic level). * ~17:30 Critical power service restored. * ~19:00 Main room service restored. --- ### Ceph Services at CERN * Ceph is used for underlying storage in the OpenStack cloud: * OpenStack block storage (1) and CephFS (2) * S3 object storage (3) * RADOS object storage for CASTOR (4) and CTA (5) * HPC filesystems (6) and storage for Kopano email (7, 8) * (each (#) is a unique cluster) * The incident of Feb 20 affected cluster (1) only. --- ### Root Cause * After one week of studying the outage in collaboration with upstream, a bug was found in the **LZ4 compression library** * Versions < 1.8.2 can incorrectly compress data * in extremely rare conditions * involving fragmented input data buffers * compressed via the LZ4 streaming API * Note: CentOS 7 ships with lz4 version 1.7.5. ---- * LZ4 compression was first enabled in this cluster in Dec 2019. * On Feb 20, the compression bug triggered a corruption in the `osdmap` central cluster metadata, causing the daemons to throw a crc error and exit. * Compression has now been disabled (on Feb 25) * Other data is not impacted, as it is compressed from large contiguous input buffers. ---- * We are highly confident in the root cause and workaround. * We are able to reproduce the identical bit corruption seen in real life. * Ceph and LZ4 developers are working with us to develop a final workaround for Ceph when running with buggy LZ4. --- ### Next Steps * The outage revealed the high impact of a block storage outage. * IT-ST is discussing how to offer several availability zones (analogous to the CERN cloud AZs) * this will empower applications to be designed with higher levels of redundancy. --- ### Information for WLCG Ceph Users * While lz4 is in question, it is advisable to disable compression or use alternative algorithms: * `bluestore_compression_mode = none` * `bluestore_compression_algorithm = snappy` * snappy, zstd, zlib were all tested and do not exhibit the same `osdmap` corruption. --- ### References * Ceph bug tracker: https://tracker.ceph.com/issues/39525 * Ceph pull request: https://github.com/ceph/ceph/pull/33584 * Technical post mortem: https://codimd.web.cern.ch/p/HJGnq3HVI#/ --- # ? ---