## System testing service developments using <img src="https://codimd.web.cern.ch/uploads/upload_1251b647b5911190debf6f1753f9adb7.svg" align="left" class="plain" width=30%><img src="https://codimd.web.cern.ch/uploads/upload_448ce34807f29c178f209b22655923d0.svg" align="right" class="plain" width=19%> ### EOS + CTA Continuous Integration system [Julien Leduc](mailto:julien.leduc@cern.ch) from **IT ST**orage group [CERN](http://www.cern.ch) --- ## Data archiving at CERN <ul> <li class="fragment">Ad aeternum storage</li> <li class="fragment">7 tape libraries, 83 tape drives, 20k tapes</li> <li class="fragment">Current use: <b style="color:dodgerblue;">330 PB</b></li> <li class="fragment">Current capacity: <b style="color:coral;">0.7 EB</b></li> <li class="fragment"><b style="color:red;">Exponentially growing</b></li> </ul> <!-- .slide: data-background="https://codimd.web.cern.ch/uploads/upload_95716d3602c009e301c880b0afd4225a.png" data-background-size="80%" --> --- <h2>Data Archiving at CERN <span class="fragment"><i style="color:blue;">Evolution</i></span></h2> <ul> <li class="fragment">EOS + tapes...</li> <ul> <li class="fragment">EOS is CERN strategic storage platform</li> <li class="fragment">tape is the strategic long term archive medium</li> </ul> <li class="fragment">EOS + tapes = <span class="fragment" style="color:red;">&hearts;</span></li> <ul> <li class="fragment">Meet CTA: CERN Tape Archive</li> <li class="fragment">Streamline data paths, software and infrastructure</li> </ul> </ul> --- ## CERN Tape Archive? <img src="https://codimd.web.cern.ch/uploads/upload_eac32c76dde5a45191434a90d54a4d5a.png" class="plain" width=60%> <ul> <li class="fragment">CTA is glued to the back of EOS</li> <li class="fragment">EOS manages CTA tape files as replicas</li> <li class="fragment">CTA contains a catalogue of all tape files</li> </ul> --- ## CTA + EOS developments</h2> <p class="fragment">Tightly coupled software <span class="fragment">&rArr; <span style="color:red;">tightly coupled developments</span></span></p> <p class="fragment"><span class="fragment highlight-blue">Extensive and systematic testing is paramount to limit regressions<span></p> ---- ## CTA + EOS integration tests (*What?*) <ul> <li class="fragment">Complex situation:</li> <ul class="fragment"> <li><b style="color:red;">2 distinct software projects</b></li> <ul><li><b style="color:red;">relying on specific shared developments (xrootd...)</b></li></ul> <li><b style="color:blue;">Several external dependencies</b> per instance: 1 database, 1 tape library, 1 objectstore</li> </ul> </ul> ---- ## CTA + EOS integration tests (*Constraints?*) <ul class="fragment"> <li>I hate <b style="color:red;">repetitive tasks</b> and I am <b style="color:blue;">impatient</b></li> <ul> <li>no manual operation <span class="fragment highlight-red">&rarr; CI</span></li> <li>make it <span class="fragment highlight-blue">fast</span></li> </ul> <li><span class="fragment highlight-red">Other possible use cases?</span></li> </ul> --- ## Kubernetes EOS CTA generic instance <ul> <li>Implement a framework based on a <span class="fragment highlight-red">single generic docker image</span>.</li> <li>Use <span class="fragment highlight-blue">Kubernetes</span> to build an EOS CTA instance out of it.</li> <li>Flexible enough to <span class="fragment highlight-red">accomodate any supported resource</span> (database, objectstore, tape library).</li> <li>Part of CTA code repository: <span class="fragment highlight-red">CI tests are evolving with the tested code</span>.</li> </ul> ---- ## Basic Kubernetes concepts <img src="https://codimd.web.cern.ch/uploads/upload_3d760e405412c241100463317f6fe22e.svg" class="plain" width=60%> ---- ## EOS CTA generic k8s instance <img src="https://codimd.web.cern.ch/uploads/upload_fc9e6f74e0b135d7b4f6438ed8d64e0e.svg" class="plain" height=60%> --- ## USE CASE 1: CTA CI Implemented in CERN Gitlab instance: - Implements kubernetes framework on a custom gitlab runner. - When instance ready run a test that `xrdcp` 10k files to EOSCTA, delete the disk copy and retrieve these from tape. ---- ## CTA gitlab CI <img src="https://codimd.web.cern.ch/uploads/upload_444ab29e58c8a175bdd3abeff584e86d.png" class="fragment plain" width=90%> ---- <img src="https://codimd.web.cern.ch/uploads/upload_91e651e87a78b4f38c230419a8bafd89.svg" class="plain" width=80%> ---- ## Some statistics 3000+ pipelines ran since CI is in place <img src="https://codimd.web.cern.ch/uploads/upload_78563856e04bc9a64950c6f151aabe6f.png" class="plain" width=80%> ---- ## Bonus: Nightly EOS regression tests Every night a Gitlab schedule runs these steps: - run standard archival test - upgrade EOS to latest dev tagged release - run standard archival test against the new EOS version <span class="fragment highlight-red">This allows CTA developers to catch EOS regressions that impact CTA specific workflows.</span> --- ## USE CASE 2: CTA developers <ul> <li>Entirely runs on <span class="fragment highlight-blue">a disconnected developer laptop</span>: <ul> <li>Implements kubernetes framework in a <span class="fragment highlight-red">Virtualbox CentOS VM</span></li> </ul> <li>Offline resources: <span class="fragment highlight-blue">local Postgres instance, local file based objectstore, MHVTL</span></li> </ul> When instance ready run specific developer test. ---- ## Strengths <ul> <li>Quickly deploys a <span class="fragment highlight-red">disposable local EOS CTA instance</span>.</li> <li>Much <span class="fragment highlight-blue">shorter learn curve for new comers</span> that can focus on their work.</li> <ul><li><span class="fragment highlight-red">Best deployment practices included</span>.</li></ul> <li>Successfully used for:</li> <ul> <li class="fragment">Objectstore developments</li> <li class="fragment">Database catalogue backend developments (`MySQL`, `Postgres`)</li> </ul> </ul> --- ## USE CASE 3: CTA stress tests Implemented in a dedicated `Puppet` managed PPS instance sized to reach 2GB/s: <ul> <li class="fragment">1 MGM</li> <li class="fragment">3 FSTs (750TB of storage)</li> <li class="fragment">1 CTA frontend <i>VM</i></li> <li class="fragment">8 tape servers and associated tape drives for BW stress tests</li> <li class="fragment">3 VTL tape server for rate stress tests <i>VMs</i></li> </ul> ---- ### PPS instance stress tests <i>issues</i> <ul> <li class="fragment">Requires lot of machines <b style="color:crimson;">12 machines and 4 VMs</b> for 2GB/s.</li> <li class="fragment">Puppet good at managing production services not so convenient if <b style="color:dodgerblue;">all manifests change between releases</b>.</li> <li class="fragment">CTA CI to Puppet is a <b style="color:crimson;">manual error prone step.</b></li> <li class="fragment">Extensive use of `rundeck` to deploy and cleanup: but still a <b style="color:crimson;">CTA release requires several hours</b>.</li> </ul> ---- ### PPS instance stress tests <i>impact</i> Low turnover stress environment: <ul> <li class="fragment">Less testing...</li> <li class="fragment">More code changes between tests: more deployment errors, more regressions.</li> <li class="fragment">Wasted human hours babysitting PPS...</li> <li class="fragment">Log collection/monitoring of PPS? O(kHz) events/machine?</li> <li class="fragment">Reproducibility?</li> <li class="fragment">Deployment best practices???? Best case: obscure devops documentation...</li> <li class="fragment">Stress test code versionning???</li> </ul> --- ## Here comes Santa! <img src="https://codimd.web.cern.ch/uploads/upload_3f0cbd71fb908391309e4aa00f40cf74.svg" class="plain" height=60%> <span class="fragment"><b style="color:dodgerblue;">4GB/s full duplex internal bandwidth</b> <i>up to 6Gb/s simplex</i></span> <span class="fragment">Initially I thought: <i>nice toy but Santa was a bit short on the network connectivity...</i></span> ---- ## Hyperconverged server usage? CI stress tests of course! <ul> <li>Implements EOSCTA k8s framework on <span class="fragment highlight-blue">hyperconverged server</span>: <ul> <li><span class="fragment highlight-red">Plenty of IOPS</span> for file rate tests</li> <li><span class="fragment highlight-red">Plenty of bandwidth </span> to simulate a sizable CTA instance (10 tape servers, 6 disk servers...)</li> </ul> </ul> When instance ready run a CI stress test that `xrdcp` **1M files to EOSCTA**, deletes the disk copy and retrieve these from tape. ---- <img src="https://codimd.web.cern.ch/uploads/upload_b53ffe846c1d564ea0d8ec67cb4512c3.svg" class="plain" width=80%> --- ## Beefy system stress tests <ul> <li class="fragment">Fast turnover that allows to quicky reproduce a bug again and again in various conditions:</li> <ul> <li class="fragment">Easy to automate <b style="color:crimson;">will go in CD step</b>.</li> <li class="fragment">Fully reproducible.</li> </ul> <li class="fragment">Allows to efficiently track down and fix performance regressions</li> <ul> <li class="fragment">Illustration with exponential performance degradation: identified, fixed and tested in 3 days <b style="color:red;">was here for 2 months</b>.</li> </ul> </ul> --- ## Real life Issue tracking/fixing ![](https://codimd.web.cern.ch/uploads/upload_55b530bddd60498a9226964f7e1e696c.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_092141e5f70dbd2b32cc0f8078f78566.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_d989ea36ae93ddba5f25c1f605ce1210.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_7e155aeb5828f47ff0e30db9a91afea7.png) ---- ![](https://codimd.web.cern.ch/uploads/upload_89da98dfb686dea44536acf909cd46fd.png) --- ## USE CASE 4: EOS in Memory to QuarkDB NS Not a simple jump but achieved quickly and with confidence ```mermaid gantt dateFormat YYYY-MM-DD title QuarkDB integration in CTA section Play first contact @EOS WS :a1, 2018-02-04, 2d section CI QuarkDB NS support added in CI :after a1 , 3d QuarkDB stress tests and debugging in CI: 2018-02-25, 2d QuarkDB stress tests and debugging in CI: 2018-03-03, 3d section Releases QuarkDB fixes for CTA in EOS 4.4.27: crit, 2018-03-07, 1d QuarkDB OK in CTA v0.0-173: crit, 2018-03-08, 1d ``` Closely working with EOS developers. ---- <!-- .slide: data-background="https://codimd.web.cern.ch/uploads/upload_5d9829c67e93d31af306b734ef8a971d.png" data-background-size="80%" --> --- <h1>THE END?</h1> <ul> <li class="fragment">Very powerful approach <font color="dodgerblue">addresses and federates all our development/testing use cases</font></li> <li class="fragment">Fast, flexible, isolated and self contained in software repository</li> <li class="fragment"><font color="crimson">Reproducible development environment</font> that allows regression and performance tests</li> </ul> ---- ## More to do <ul> <li class="fragment"><font color="dodgerblue">Automate Continuous deployment</font> in gitlab for stress tests?</li> <li class="fragment"><font color="dodgerblue">Performance regression tests</font></li> <li class="fragment"><font color="crimson">Bandwidth performance tests @4GB/s</font> on Beefy system</li> <li class="fragment"><font color="crimson">CASTOR namespace ingestion tests</font>. Double it in CI stress instance?</li> <li class="fragment">Add <font color="dodgerblue">full experiment data workflows for T0</font> in CI</li> <li class="fragment">Evaluate possible production use &#x263A;</li> </ul> <b class="fragment">THANK YOU FOR YOUR ATTENTION!</b> --- {%vimeo 322128118 %}
{"title":"ITTF - System testing EOS-CTA using Kubernetes","description":"Presentation of CERN Tape Archive CI strategy","slideOptions":{"transition":"slide","theme":"white"}}