## System testing service developments using
<img src="https://codimd.web.cern.ch/uploads/upload_1251b647b5911190debf6f1753f9adb7.svg" align="left" class="plain" width=40%><img src="https://codimd.web.cern.ch/uploads/upload_448ce34807f29c178f209b22655923d0.svg" align="right" class="plain" width=26%>
### Update on EOS + CTA Continuous Integration system
[Julien Leduc](mailto:julien.leduc@cern.ch) from **IT ST**orage group [CERN](http://www.cern.ch)
---
## CERN Tape Archive?
<img src="https://codimd.web.cern.ch/uploads/upload_1bfa8c1cf4315b22ccc431657760f820.png" class="plain" size=60%>
<ul>
<li class="fragment">CTA is glued to the back of EOS</li>
<li class="fragment">EOS manages CTA tape files as replicas</li>
<li class="fragment">CTA contains a catalogue of all tape files</li>
<li class="fragment">More on CTA tomorrow morning!</li>
</ul>
---
## CTA + EOS developments</h2>
<p class="fragment">Tightly coupled software <span class="fragment">⇒ <span style="color:red;">tightly coupled developments</span></span></p>
<p class="fragment"><span class="fragment highlight-blue">Extensive and systematic testing is paramount to limit regressions<span></p>
----
## CTA + EOS integration tests (*What?*)
<ul>
<li class="fragment">Complex situation:</li>
<ul class="fragment">
<li><b style="color:red;">2 distinct software projects</b></li>
<ul><li><b style="color:red;">relying on specific shared developments (xrootd...)</b></li></ul>
<li><b style="color:blue;">Several external dependencies</b> per instance: 1 database, 1 tape library, 1 objectstore</li>
</ul>
</ul>
----
## CTA + EOS integration tests (*Constraints?*)
<ul class="fragment">
<li>I hate <b style="color:red;">repetitive tasks</b> and I am <b style="color:blue;">impatient</b></li>
<ul>
<li>no manual operation <span class="fragment highlight-red">→ CI</span></li>
<li>make it <span class="fragment highlight-blue">fast</span></li>
</ul>
<li><span class="fragment highlight-red">Other possible use cases?</span></li>
</ul>
---
## Kubernetes EOS CTA generic instance
<ul>
<li>Implement a framework based on a <span class="fragment highlight-red">single generic docker image</span>.</li>
<li>Use <span class="fragment highlight-blue">Kubernetes</span> to build an EOS CTA instance out of it.</li>
<li>Flexible enough to <span class="fragment highlight-red">accomodate any supported resource</span> (database, objecstore, tape library).</li>
<li>Part of CTA code repository: <span class="fragment highlight-red">CI tests are evolving with the tested code</span>.</li>
</ul>
----
## Basic Kubernetes concepts
<img src="https://codimd.web.cern.ch/uploads/upload_3d760e405412c241100463317f6fe22e.svg" class="plain" width=60%>
----
## EOS CTA generic instance
<img src="https://codimd.web.cern.ch/uploads/upload_a91aec66ccd367d5ec7b43d5ec4252f3.svg" class="plain" height=60%>
---
## USE CASE 1: CTA CI
Implemented in CERN Gitlab instance:
- Implements kubernetes framework on a gitlab runner.
- Resources:
- external Oracle DB instance
- external Ceph objectstore
- MHVTL
- When instance ready run a test that `xrdcp` 10k files to EOSCTA, delete the disk copy and retrieve these from tape.
---
## CTA CI
<img src="https://codimd.web.cern.ch/uploads/upload_bcc557bb2e08504c8592f027c6a20837.png" class="fragment plain" width=90%>
<ul>
<li class="fragment">Build software: CTA RPMs available as <b style="color:blue;">artifacts</b></li>
<li class="fragment">Build and publish a <b style="color:red;">generic Docker image</b> in gitlab registry</li>
<ul>
<li class="fragment">Contains <b style="color:blue;">all required versioned software (artifacts)</b> and access to <b style="color:red">versioned software cache repository</b> for dependencies</b></li>
</ul>
<li class="fragment">Run <b style="color:red;">system tests</b> in single VM <code>kubernetes</code> cluster (specific gitlab-runner)</li>
</ul>
----
<img src="https://codimd.web.cern.ch/uploads/upload_91e651e87a78b4f38c230419a8bafd89.svg" class="plain" width=80%>
----
## Some statistics
3000+ pipelines ran since CI is in place
<img src="https://codimd.web.cern.ch/uploads/upload_78563856e04bc9a64950c6f151aabe6f.png" class="plain" width=80%>
----
## Bonus: Nightly EOS regression tests
Every night a Gitlab schedule runs these steps:
- run standard archival test
- upgrade EOS to latest dev tagged release
- run standard archival test against the new EOS version
This allows CTA developers to catch EOS regressions that impact CTA specific workflows.
---
## USE CASE 2: CTA developers
<ul>
<li>Entirely runs on <span class="fragment highlight-blue">developer laptop</span>:
<ul>
<li>Implements kubernetes framework in a <span class="fragment highlight-red">Virtualbox CentOS VM</span></li>
</ul>
<li>Offline resources: <span class="fragment highlight-blue">local sqlite DB, local file based objectstore, MHVTL</span></li>
</ul>
When instance ready run specific developer test.
----
## Strengths
<ul>
<li>Quickly deploys a <span class="fragment highlight-red">disposable local EOS CTA instance</span>.</li>
<li>Much <span class="fragment highlight-blue">shorter learn curve for new comers</span> that can focus on their work.</li>
<ul><li><span class="fragment highlight-red">Best deployment practices included</span>.</li></ul>
<li>Successfully used for:</li>
<ul>
<li class="fragment">Objectstore developments</li>
<li class="fragment">Database catalogue backend developments (`mysql`, `postgres`)</li>
</ul>
<li class="fragment">Developers improve CI code for me.</li>
</ul>
---
## USE CASE 3: CTA PPS stress tests
Initially Implemented in a dedicated `Puppet` managed PPS instance to reach 2GB/s:
<ul>
<li class="fragment">1 MGM</li>
<li class="fragment">3 FSTs (750TB of storage)</li>
<li class="fragment">1 CTA frontend</li>
<li class="fragment">8 tape servers and associated tape drives for BW stress tests</li>
<li class="fragment">3 VTL tape server for rate stress tests</li>
</ul>
----
## PPS instance stress tests
Many issues with VM/Puppet approach:
<ul>
<li class="fragment">Code changes in EOS CTA often requires error prone <b style="color:red;">manual Puppet manifest changes or manual reconfiguration</b>.</li>
<li class="fragment">Extensive use of `rundeck` to deploy a CTA release still requires several hours.</li>
</ul>
----
## PPS instance stress tests
Low turnover leads to:
<ul>
<li class="fragment">Less testing...</li>
<li class="fragment">More code changes between 2 tests: more deployment errors, more regressions.</li>
<li class="fragment">Time consumming PPS babysitting...</li>
<li class="fragment">Log collection/monitoring of PPS? O(kHz) events/machine?</li>
<li class="fragment">Reproducibility?</li>
<li class="fragment">Deployment best practices???? Best case: obscure devops documentation...</li>
</ul>
----
## Here comes the *Beefy system*
<ul>
<li>Implements kubernetes framework on <span class="fragment highlight-blue">one hyperconverged server</span> with <span class="fragment highlight-red">16 SSDs</span>:
<ul>
<li><span class="fragment highlight-red">Plenty of IOPS</span> for VTL rate tests</li>
<li><span class="fragment highlight-red">Plenty of bandwidth</span> to model a sizable CTA instance (10 tape servers, 6 FSTs...)</li>
</ul>
<li>Resources: <span class="fragment highlight-blue">Oracle DB instance, Ceph objectstore, MHVTL</span></li>
</ul>
When instance ready run a beefy CI test that `xrdcp` **1M files to EOS CTA**, delete the disk copy and retrieve these from tape.
----
<img src="https://codimd.web.cern.ch/uploads/upload_b53ffe846c1d564ea0d8ec67cb4512c3.svg" class="plain" width=80%>
---
## Beefy system stress tests
<ul>
<li class="fragment">Fast turnover that allows to quicky reproduce a bug again and again in various conditions:</li>
<ul>
<li class="fragment">Fully automated <b style="color:red;">will go in CD step</b>.</li>
<li class="fragment">Fully reproducible.</li>
</ul>
<li class="fragment">Allowed me to successfully track down an exponential performance degradation regression</li>
<ul>
<li class="fragment">Identified, fixed and tested in 3 days <b style="color:red;">was here for 2 months</b>.</li>
</ul>
<li class="fragment">Allowed me to identify a bug in the frontend that killed queuing performance.</li>
</ul>
---
## Real life Issue tracking/fixing
![](https://codimd.web.cern.ch/uploads/upload_55b530bddd60498a9226964f7e1e696c.png)
----
![](https://codimd.web.cern.ch/uploads/upload_092141e5f70dbd2b32cc0f8078f78566.png)
----
![](https://codimd.web.cern.ch/uploads/upload_d989ea36ae93ddba5f25c1f605ce1210.png)
----
![](https://codimd.web.cern.ch/uploads/upload_7e155aeb5828f47ff0e30db9a91afea7.png)
----
![](https://codimd.web.cern.ch/uploads/upload_89da98dfb686dea44536acf909cd46fd.png)
---
<h1>THE END?</h1>
<ul>
<li class="fragment">Very powerful approach <font color="blue">addresses and federates all our development/testing use cases</font></li>
<li class="fragment">Fast, flexible, isolated and self contained in software repository</li>
<li class="fragment"><font color="red">Reproducible development environment</font> that allows regression and performance tests</li>
</ul>
<h2 class="fragment">TO DO</h2>
<ul>
<li class="fragment">Automatic log analysis</li>
<li class="fragment">Bandwidth performance tests</li>
<li class="fragment">Evaluate possible production use ☺</li>
</ul>
{"title":"System testing EOS-CTA using Kubernetes","description":"Presentation of CERN Tape Archive CI","slideOptions":{"transition":"slide","theme":"white"}}