## System testing service developments using <img src="https://codimd.web.cern.ch/uploads/upload_1251b647b5911190debf6f1753f9adb7.svg" align="left" class="plain" width=40%><img src="https://codimd.web.cern.ch/uploads/upload_448ce34807f29c178f209b22655923d0.svg" align="right" class="plain" width=26%> ### Update on EOS + CTA Continuous Integration system [Julien Leduc](mailto:julien.leduc@cern.ch) from **IT ST**orage group [CERN](http://www.cern.ch) --- ## CERN Tape Archive? <img src="https://codimd.web.cern.ch/uploads/upload_1bfa8c1cf4315b22ccc431657760f820.png" class="plain" size=60%> <ul> <li class="fragment">CTA is glued to the back of EOS</li> <li class="fragment">EOS manages CTA tape files as replicas</li> <li class="fragment">CTA contains a catalogue of all tape files</li> <li class="fragment">More on CTA tomorrow morning!</li> </ul> --- ## CTA + EOS developments</h2> <p class="fragment">Tightly coupled software <span class="fragment">&rArr; <span style="color:red;">tightly coupled developments</span></span></p> <p class="fragment"><span class="fragment highlight-blue">Extensive and systematic testing is paramount to limit regressions<span></p> ---- ## CTA + EOS integration tests (*What?*) <ul> <li class="fragment">Complex situation:</li> <ul class="fragment"> <li><b style="color:red;">2 distinct software projects</b></li> <ul><li><b style="color:red;">relying on specific shared developments (xrootd...)</b></li></ul> <li><b style="color:blue;">Several external dependencies</b> per instance: 1 database, 1 tape library, 1 objectstore</li> </ul> </ul> ---- ## CTA + EOS integration tests (*Constraints?*) <ul class="fragment"> <li>I hate <b style="color:red;">repetitive tasks</b> and I am <b style="color:blue;">impatient</b></li> <ul> <li>no manual operation <span class="fragment highlight-red">&rarr; CI</span></li> <li>make it <span class="fragment highlight-blue">fast</span></li> </ul> <li><span class="fragment highlight-red">Other possible use cases?</span></li> </ul> --- ## Kubernetes EOS CTA generic instance <ul> <li>Implement a framework based on a <span class="fragment highlight-red">single generic docker image</span>.</li> <li>Use <span class="fragment highlight-blue">Kubernetes</span> to build an EOS CTA instance out of it.</li> <li>Flexible enough to <span class="fragment highlight-red">accomodate any supported resource</span> (database, objecstore, tape library).</li> <li>Part of CTA code repository: <span class="fragment highlight-red">CI tests are evolving with the tested code</span>.</li> </ul> ---- ## Basic Kubernetes concepts <img src="https://codimd.web.cern.ch/uploads/upload_3d760e405412c241100463317f6fe22e.svg" class="plain" width=60%> ---- ## EOS CTA generic instance <img src="https://codimd.web.cern.ch/uploads/upload_a91aec66ccd367d5ec7b43d5ec4252f3.svg" class="plain" height=60%> --- ## USE CASE 1: CTA CI Implemented in CERN Gitlab instance: - Implements kubernetes framework on a gitlab runner. - Resources: - external Oracle DB instance - external Ceph objectstore - MHVTL - When instance ready run a test that `xrdcp` 10k files to EOSCTA, delete the disk copy and retrieve these from tape. --- ## CTA CI <img src="https://codimd.web.cern.ch/uploads/upload_bcc557bb2e08504c8592f027c6a20837.png" class="fragment plain" width=90%> <ul> <li class="fragment">Build software: CTA RPMs available as <b style="color:blue;">artifacts</b></li> <li class="fragment">Build and publish a <b style="color:red;">generic Docker image</b> in gitlab registry</li> <ul> <li class="fragment">Contains <b style="color:blue;">all required versioned software (artifacts)</b> and access to <b style="color:red">versioned software cache repository</b> for dependencies</b></li> </ul> <li class="fragment">Run <b style="color:red;">system tests</b> in single VM <code>kubernetes</code> cluster (specific gitlab-runner)</li> </ul> ---- <img src="https://codimd.web.cern.ch/uploads/upload_91e651e87a78b4f38c230419a8bafd89.svg" class="plain" width=80%> ---- ## Some statistics 3000+ pipelines ran since CI is in place <img src="https://codimd.web.cern.ch/uploads/upload_78563856e04bc9a64950c6f151aabe6f.png" class="plain" width=80%> ---- ## Bonus: Nightly EOS regression tests Every night a Gitlab schedule runs these steps: - run standard archival test - upgrade EOS to latest dev tagged release - run standard archival test against the new EOS version This allows CTA developers to catch EOS regressions that impact CTA specific workflows. --- ## USE CASE 2: CTA developers <ul> <li>Entirely runs on <span class="fragment highlight-blue">developer laptop</span>: <ul> <li>Implements kubernetes framework in a <span class="fragment highlight-red">Virtualbox CentOS VM</span></li> </ul> <li>Offline resources: <span class="fragment highlight-blue">local sqlite DB, local file based objectstore, MHVTL</span></li> </ul> When instance ready run specific developer test. ---- ## Strengths <ul> <li>Quickly deploys a <span class="fragment highlight-red">disposable local EOS CTA instance</span>.</li> <li>Much <span class="fragment highlight-blue">shorter learn curve for new comers</span> that can focus on their work.</li> <ul><li><span class="fragment highlight-red">Best deployment practices included</span>.</li></ul> <li>Successfully used for:</li> <ul> <li class="fragment">Objectstore developments</li> <li class="fragment">Database catalogue backend developments (`mysql`, `postgres`)</li> </ul> <li class="fragment">Developers improve CI code for me.</li> </ul> --- ## USE CASE 3: CTA PPS stress tests Initially Implemented in a dedicated `Puppet` managed PPS instance to reach 2GB/s: <ul> <li class="fragment">1 MGM</li> <li class="fragment">3 FSTs (750TB of storage)</li> <li class="fragment">1 CTA frontend</li> <li class="fragment">8 tape servers and associated tape drives for BW stress tests</li> <li class="fragment">3 VTL tape server for rate stress tests</li> </ul> ---- ## PPS instance stress tests Many issues with VM/Puppet approach: <ul> <li class="fragment">Code changes in EOS CTA often requires error prone <b style="color:red;">manual Puppet manifest changes or manual reconfiguration</b>.</li> <li class="fragment">Extensive use of `rundeck` to deploy a CTA release still requires several hours.</li> </ul> ---- ## PPS instance stress tests Low turnover leads to: <ul> <li class="fragment">Less testing...</li> <li class="fragment">More code changes between 2 tests: more deployment errors, more regressions.</li> <li class="fragment">Time consumming PPS babysitting...</li> <li class="fragment">Log collection/monitoring of PPS? O(kHz) events/machine?</li> <li class="fragment">Reproducibility?</li> <li class="fragment">Deployment best practices???? Best case: obscure devops documentation...</li> </ul> ---- ## Here comes the *Beefy system* <ul> <li>Implements kubernetes framework on <span class="fragment highlight-blue">one hyperconverged server</span> with <span class="fragment highlight-red">16 SSDs</span>: <ul> <li><span class="fragment highlight-red">Plenty of IOPS</span> for VTL rate tests</li> <li><span class="fragment highlight-red">Plenty of bandwidth</span> to model a sizable CTA instance (10 tape servers, 6 FSTs...)</li> </ul> <li>Resources: <span class="fragment highlight-blue">Oracle DB instance, Ceph objectstore, MHVTL</span></li> </ul> When instance ready run a beefy CI test that `xrdcp` **1M files to EOS CTA**, delete the disk copy and retrieve these from tape. ---- <img src="https://codimd.web.cern.ch/uploads/upload_b53ffe846c1d564ea0d8ec67cb4512c3.svg" class="plain" width=80%> --- ## Beefy system stress tests <ul> <li class="fragment">Fast turnover that allows to quicky reproduce a bug again and again in various conditions:</li> <ul> <li class="fragment">Fully automated <b style="color:red;">will go in CD step</b>.</li> <li class="fragment">Fully reproducible.</li> </ul> <li class="fragment">Allowed me to successfully track down an exponential performance degradation regression</li> <ul> <li class="fragment">Identified, fixed and tested in 3 days <b style="color:red;">was here for 2 months</b>.</li> </ul> <li class="fragment">Allowed me to identify a bug in the frontend that killed queuing performance.</li> </ul> --- ## Real life Issue tracking/fixing ![](https://codimd.web.cern.ch/uploads/upload_55b530bddd60498a9226964f7e1e696c.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_092141e5f70dbd2b32cc0f8078f78566.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_d989ea36ae93ddba5f25c1f605ce1210.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_7e155aeb5828f47ff0e30db9a91afea7.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_89da98dfb686dea44536acf909cd46fd.png) --- <h1>THE END?</h1> <ul> <li class="fragment">Very powerful approach <font color="blue">addresses and federates all our development/testing use cases</font></li> <li class="fragment">Fast, flexible, isolated and self contained in software repository</li> <li class="fragment"><font color="red">Reproducible development environment</font> that allows regression and performance tests</li> </ul> <h2 class="fragment">TO DO</h2> <ul> <li class="fragment">Automatic log analysis</li> <li class="fragment">Bandwidth performance tests</li> <li class="fragment">Evaluate possible production use &#x263A;</li> </ul>
{"title":"System testing EOS-CTA using Kubernetes","description":"Presentation of CERN Tape Archive CI","slideOptions":{"transition":"slide","theme":"white"}}