System testing EOS-CTA using Kubernetes

## System testing service developments using <img src="https://codimd.web.cern.ch/uploads/upload_1251b647b5911190debf6f1753f9adb7.svg" align="left" class="plain" width=40%><img src="https://codimd.web.cern.ch/uploads/upload_448ce34807f29c178f209b22655923d0.svg" align="right" class="plain" width=26%> ### Update on EOS + CTA Continuous Integration system [Julien Leduc](mailto:julien.leduc@cern.ch) from **IT ST**orage group [CERN](http://www.cern.ch) --- ## CERN Tape Archive? <img src="https://codimd.web.cern.ch/uploads/upload_1bfa8c1cf4315b22ccc431657760f820.png" class="plain" size=60%> <ul> <li class="fragment">CTA is glued to the back of EOS</li> <li class="fragment">EOS manages CTA tape files as replicas</li> <li class="fragment">CTA contains a catalogue of all tape files</li> <li class="fragment">More on CTA tomorrow morning!</li> </ul> --- ## CTA + EOS developments</h2> Tightly coupled software ⇒ tightly coupled developments Extensive and systematic testing is paramount to limit regressions ---- ## CTA + EOS integration tests (*What?*) <ul> <li class="fragment">Complex situation:</li> <ul class="fragment"> <li>2 distinct software projects</li> <ul><li>relying on specific shared developments (xrootd...)</li></ul> <li>Several external dependencies per instance: 1 database, 1 tape library, 1 objectstore</li> </ul> </ul> ---- ## CTA + EOS integration tests (*Constraints?*) <ul class="fragment"> <li>I hate repetitive tasks and I am impatient</li> <ul> <li>no manual operation → CI</li> <li>make it fast</li> </ul> <li>Other possible use cases?</li> </ul> --- ## Kubernetes EOS CTA generic instance <ul> <li>Implement a framework based on a single generic docker image.</li> <li>Use Kubernetes to build an EOS CTA instance out of it.</li> <li>Flexible enough to accomodate any supported resource (database, objecstore, tape library).</li> <li>Part of CTA code repository: CI tests are evolving with the tested code.</li> </ul> ---- ## Basic Kubernetes concepts <img src="https://codimd.web.cern.ch/uploads/upload_3d760e405412c241100463317f6fe22e.svg" class="plain" width=60%> ---- ## EOS CTA generic instance <img src="https://codimd.web.cern.ch/uploads/upload_a91aec66ccd367d5ec7b43d5ec4252f3.svg" class="plain" height=60%> --- ## USE CASE 1: CTA CI Implemented in CERN Gitlab instance: - Implements kubernetes framework on a gitlab runner. - Resources: - external Oracle DB instance - external Ceph objectstore - MHVTL - When instance ready run a test that `xrdcp` 10k files to EOSCTA, delete the disk copy and retrieve these from tape. --- ## CTA CI <img src="https://codimd.web.cern.ch/uploads/upload_bcc557bb2e08504c8592f027c6a20837.png" class="fragment plain" width=90%> <ul> <li class="fragment">Build software: CTA RPMs available as artifacts</li> <li class="fragment">Build and publish a generic Docker image in gitlab registry</li> <ul> <li class="fragment">Contains all required versioned software (artifacts) and access to versioned software cache repository for dependencies</li> </ul> <li class="fragment">Run system tests in single VM <code>kubernetes</code> cluster (specific gitlab-runner)</li> </ul> ---- <img src="https://codimd.web.cern.ch/uploads/upload_91e651e87a78b4f38c230419a8bafd89.svg" class="plain" width=80%> ---- ## Some statistics 3000+ pipelines ran since CI is in place <img src="https://codimd.web.cern.ch/uploads/upload_78563856e04bc9a64950c6f151aabe6f.png" class="plain" width=80%> ---- ## Bonus: Nightly EOS regression tests Every night a Gitlab schedule runs these steps: - run standard archival test - upgrade EOS to latest dev tagged release - run standard archival test against the new EOS version This allows CTA developers to catch EOS regressions that impact CTA specific workflows. --- ## USE CASE 2: CTA developers <ul> <li>Entirely runs on developer laptop: <ul> <li>Implements kubernetes framework in a Virtualbox CentOS VM</li> </ul> <li>Offline resources: local sqlite DB, local file based objectstore, MHVTL</li> </ul> When instance ready run specific developer test. ---- ## Strengths <ul> <li>Quickly deploys a disposable local EOS CTA instance.</li> <li>Much shorter learn curve for new comers that can focus on their work.</li> <ul><li>Best deployment practices included.</li></ul> <li>Successfully used for:</li> <ul> <li class="fragment">Objectstore developments</li> <li class="fragment">Database catalogue backend developments (`mysql`, `postgres`)</li> </ul> <li class="fragment">Developers improve CI code for me.</li> </ul> --- ## USE CASE 3: CTA PPS stress tests Initially Implemented in a dedicated `Puppet` managed PPS instance to reach 2GB/s: <ul> <li class="fragment">1 MGM</li> <li class="fragment">3 FSTs (750TB of storage)</li> <li class="fragment">1 CTA frontend</li> <li class="fragment">8 tape servers and associated tape drives for BW stress tests</li> <li class="fragment">3 VTL tape server for rate stress tests</li> </ul> ---- ## PPS instance stress tests Many issues with VM/Puppet approach: <ul> <li class="fragment">Code changes in EOS CTA often requires error prone manual Puppet manifest changes or manual reconfiguration.</li> <li class="fragment">Extensive use of `rundeck` to deploy a CTA release still requires several hours.</li> </ul> ---- ## PPS instance stress tests Low turnover leads to: <ul> <li class="fragment">Less testing...</li> <li class="fragment">More code changes between 2 tests: more deployment errors, more regressions.</li> <li class="fragment">Time consumming PPS babysitting...</li> <li class="fragment">Log collection/monitoring of PPS? O(kHz) events/machine?</li> <li class="fragment">Reproducibility?</li> <li class="fragment">Deployment best practices???? Best case: obscure devops documentation...</li> </ul> ---- ## Here comes the *Beefy system* <ul> <li>Implements kubernetes framework on one hyperconverged server with 16 SSDs: <ul> <li>Plenty of IOPS for VTL rate tests</li> <li>Plenty of bandwidth to model a sizable CTA instance (10 tape servers, 6 FSTs...)</li> </ul> <li>Resources: Oracle DB instance, Ceph objectstore, MHVTL</li> </ul> When instance ready run a beefy CI test that `xrdcp` **1M files to EOS CTA**, delete the disk copy and retrieve these from tape. ---- <img src="https://codimd.web.cern.ch/uploads/upload_b53ffe846c1d564ea0d8ec67cb4512c3.svg" class="plain" width=80%> --- ## Beefy system stress tests <ul> <li class="fragment">Fast turnover that allows to quicky reproduce a bug again and again in various conditions:</li> <ul> <li class="fragment">Fully automated will go in CD step.</li> <li class="fragment">Fully reproducible.</li> </ul> <li class="fragment">Allowed me to successfully track down an exponential performance degradation regression</li> <ul> <li class="fragment">Identified, fixed and tested in 3 days was here for 2 months.</li> </ul> <li class="fragment">Allowed me to identify a bug in the frontend that killed queuing performance.</li> </ul> --- ## Real life Issue tracking/fixing ![](https://codimd.web.cern.ch/uploads/upload_55b530bddd60498a9226964f7e1e696c.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_092141e5f70dbd2b32cc0f8078f78566.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_d989ea36ae93ddba5f25c1f605ce1210.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_7e155aeb5828f47ff0e30db9a91afea7.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_89da98dfb686dea44536acf909cd46fd.png) --- <h1>THE END?</h1> <ul> <li class="fragment">Very powerful approach addresses and federates all our development/testing use cases</li> <li class="fragment">Fast, flexible, isolated and self contained in software repository</li> <li class="fragment">Reproducible development environment that allows regression and performance tests</li> </ul> <h2 class="fragment">TO DO</h2> <ul> <li class="fragment">Automatic log analysis</li> <li class="fragment">Bandwidth performance tests</li> <li class="fragment">Evaluate possible production use ☺</li> </ul>