## System testing service developments using
<img src="https://codimd.web.cern.ch/uploads/upload_1251b647b5911190debf6f1753f9adb7.svg" align="left" class="plain" width=30%><img src="https://codimd.web.cern.ch/uploads/upload_448ce34807f29c178f209b22655923d0.svg" align="right" class="plain" width=19%>
### EOS + CTA Continuous Integration system
[Julien Leduc](mailto:julien.leduc@cern.ch) from **IT ST**orage group [CERN](http://www.cern.ch)
---
## Data archiving at CERN
<ul>
<li class="fragment">Ad aeternum storage</li>
<li class="fragment">7 tape libraries, 83 tape drives, 20k tapes</li>
<li class="fragment">Current use: <b style="color:dodgerblue;">330 PB</b></li>
<li class="fragment">Current capacity: <b style="color:coral;">0.7 EB</b></li>
<li class="fragment"><b style="color:red;">Exponentially growing</b></li>
</ul>
<!-- .slide: data-background="https://codimd.web.cern.ch/uploads/upload_95716d3602c009e301c880b0afd4225a.png" data-background-size="80%" -->
---
<h2>Data Archiving at CERN <span class="fragment"><i style="color:blue;">Evolution</i></span></h2>
<ul>
<li class="fragment">EOS + tapes...</li>
<ul>
<li class="fragment">EOS is CERN strategic storage platform</li>
<li class="fragment">tape is the strategic long term archive medium</li>
</ul>
<li class="fragment">EOS + tapes = <span class="fragment" style="color:red;">♥</span></li>
<ul>
<li class="fragment">Meet CTA: CERN Tape Archive</li>
<li class="fragment">Streamline data paths, software and infrastructure</li>
</ul>
</ul>
---
## CERN Tape Archive?
<img src="https://codimd.web.cern.ch/uploads/upload_eac32c76dde5a45191434a90d54a4d5a.png" class="plain" width=60%>
<ul>
<li class="fragment">CTA is glued to the back of EOS</li>
<li class="fragment">EOS manages CTA tape files as replicas</li>
<li class="fragment">CTA contains a catalogue of all tape files</li>
</ul>
---
## CTA + EOS developments</h2>
<p class="fragment">Tightly coupled software <span class="fragment">⇒ <span style="color:red;">tightly coupled developments</span></span></p>
<p class="fragment"><span class="fragment highlight-blue">Extensive and systematic testing is paramount to limit regressions<span></p>
----
## CTA + EOS integration tests (*What?*)
<ul>
<li class="fragment">Complex situation:</li>
<ul class="fragment">
<li><b style="color:red;">2 distinct software projects</b></li>
<ul><li><b style="color:red;">relying on specific shared developments (xrootd...)</b></li></ul>
<li><b style="color:blue;">Several external dependencies</b> per instance: 1 database, 1 tape library, 1 objectstore</li>
</ul>
</ul>
----
## CTA + EOS integration tests (*Constraints?*)
<ul class="fragment">
<li>I hate <b style="color:red;">repetitive tasks</b> and I am <b style="color:blue;">impatient</b></li>
<ul>
<li>no manual operation <span class="fragment highlight-red">→ CI</span></li>
<li>make it <span class="fragment highlight-blue">fast</span></li>
</ul>
<li><span class="fragment highlight-red">Other possible use cases?</span></li>
</ul>
---
## Kubernetes EOS CTA generic instance
<ul>
<li>Implement a framework based on a <span class="fragment highlight-red">single generic docker image</span>.</li>
<li>Use <span class="fragment highlight-blue">Kubernetes</span> to build an EOS CTA instance out of it.</li>
<li>Flexible enough to <span class="fragment highlight-red">accomodate any supported resource</span> (database, objectstore, tape library).</li>
<li>Part of CTA code repository: <span class="fragment highlight-red">CI tests are evolving with the tested code</span>.</li>
</ul>
----
## Basic Kubernetes concepts
<img src="https://codimd.web.cern.ch/uploads/upload_3d760e405412c241100463317f6fe22e.svg" class="plain" width=60%>
----
## EOS CTA generic k8s instance
<img src="https://codimd.web.cern.ch/uploads/upload_fc9e6f74e0b135d7b4f6438ed8d64e0e.svg" class="plain" height=60%>
---
## USE CASE 1: CTA CI
Implemented in CERN Gitlab instance:
- Implements kubernetes framework on a custom gitlab runner.
- When instance ready run a test that `xrdcp` 10k files to EOSCTA, delete the disk copy and retrieve these from tape.
----
## CTA gitlab CI
<img src="https://codimd.web.cern.ch/uploads/upload_444ab29e58c8a175bdd3abeff584e86d.png" class="fragment plain" width=90%>
----
<img src="https://codimd.web.cern.ch/uploads/upload_91e651e87a78b4f38c230419a8bafd89.svg" class="plain" width=80%>
----
## Some statistics
3000+ pipelines ran since CI is in place
<img src="https://codimd.web.cern.ch/uploads/upload_78563856e04bc9a64950c6f151aabe6f.png" class="plain" width=80%>
----
## Bonus: Nightly EOS regression tests
Every night a Gitlab schedule runs these steps:
- run standard archival test
- upgrade EOS to latest dev tagged release
- run standard archival test against the new EOS version
<span class="fragment highlight-red">This allows CTA developers to catch EOS regressions that impact CTA specific workflows.</span>
---
## USE CASE 2: CTA developers
<ul>
<li>Entirely runs on <span class="fragment highlight-blue">a disconnected developer laptop</span>:
<ul>
<li>Implements kubernetes framework in a <span class="fragment highlight-red">Virtualbox CentOS VM</span></li>
</ul>
<li>Offline resources: <span class="fragment highlight-blue">local Postgres instance, local file based objectstore, MHVTL</span></li>
</ul>
When instance ready run specific developer test.
----
## Strengths
<ul>
<li>Quickly deploys a <span class="fragment highlight-red">disposable local EOS CTA instance</span>.</li>
<li>Much <span class="fragment highlight-blue">shorter learn curve for new comers</span> that can focus on their work.</li>
<ul><li><span class="fragment highlight-red">Best deployment practices included</span>.</li></ul>
<li>Successfully used for:</li>
<ul>
<li class="fragment">Objectstore developments</li>
<li class="fragment">Database catalogue backend developments (`MySQL`, `Postgres`)</li>
</ul>
</ul>
---
## USE CASE 3: CTA stress tests
Implemented in a dedicated `Puppet` managed PPS instance sized to reach 2GB/s:
<ul>
<li class="fragment">1 MGM</li>
<li class="fragment">3 FSTs (750TB of storage)</li>
<li class="fragment">1 CTA frontend <i>VM</i></li>
<li class="fragment">8 tape servers and associated tape drives for BW stress tests</li>
<li class="fragment">3 VTL tape server for rate stress tests <i>VMs</i></li>
</ul>
----
### PPS instance stress tests <i>issues</i>
<ul>
<li class="fragment">Requires lot of machines <b style="color:crimson;">12 machines and 4 VMs</b> for 2GB/s.</li>
<li class="fragment">Puppet good at managing production services not so convenient if <b style="color:dodgerblue;">all manifests change between releases</b>.</li>
<li class="fragment">CTA CI to Puppet is a <b style="color:crimson;">manual error prone step.</b></li>
<li class="fragment">Extensive use of `rundeck` to deploy and cleanup: but still a <b style="color:crimson;">CTA release requires several hours</b>.</li>
</ul>
----
### PPS instance stress tests <i>impact</i>
Low turnover stress environment:
<ul>
<li class="fragment">Less testing...</li>
<li class="fragment">More code changes between tests: more deployment errors, more regressions.</li>
<li class="fragment">Wasted human hours babysitting PPS...</li>
<li class="fragment">Log collection/monitoring of PPS? O(kHz) events/machine?</li>
<li class="fragment">Reproducibility?</li>
<li class="fragment">Deployment best practices???? Best case: obscure devops documentation...</li>
<li class="fragment">Stress test code versionning???</li>
</ul>
---
## Here comes Santa!
<img src="https://codimd.web.cern.ch/uploads/upload_3f0cbd71fb908391309e4aa00f40cf74.svg" class="plain" height=60%>
<span class="fragment"><b style="color:dodgerblue;">4GB/s full duplex internal bandwidth</b> <i>up to 6Gb/s simplex</i></span>
<span class="fragment">Initially I thought: <i>nice toy but Santa was a bit short on the network connectivity...</i></span>
----
## Hyperconverged server usage?
CI stress tests of course!
<ul>
<li>Implements EOSCTA k8s framework on <span class="fragment highlight-blue">hyperconverged server</span>:
<ul>
<li><span class="fragment highlight-red">Plenty of IOPS</span> for file rate tests</li>
<li><span class="fragment highlight-red">Plenty of bandwidth </span> to simulate a sizable CTA instance (10 tape servers, 6 disk servers...)</li>
</ul>
</ul>
When instance ready run a CI stress test that `xrdcp` **1M files to EOSCTA**, deletes the disk copy and retrieve these from tape.
----
<img src="https://codimd.web.cern.ch/uploads/upload_b53ffe846c1d564ea0d8ec67cb4512c3.svg" class="plain" width=80%>
---
## Beefy system stress tests
<ul>
<li class="fragment">Fast turnover that allows to quicky reproduce a bug again and again in various conditions:</li>
<ul>
<li class="fragment">Easy to automate <b style="color:crimson;">will go in CD step</b>.</li>
<li class="fragment">Fully reproducible.</li>
</ul>
<li class="fragment">Allows to efficiently track down and fix performance regressions</li>
<ul>
<li class="fragment">Illustration with exponential performance degradation: identified, fixed and tested in 3 days <b style="color:red;">was here for 2 months</b>.</li>
</ul>
</ul>
---
## Real life Issue tracking/fixing
![](https://codimd.web.cern.ch/uploads/upload_55b530bddd60498a9226964f7e1e696c.png)
----
![](https://codimd.web.cern.ch/uploads/upload_092141e5f70dbd2b32cc0f8078f78566.png)
----
![](https://codimd.web.cern.ch/uploads/upload_d989ea36ae93ddba5f25c1f605ce1210.png)
----
![](https://codimd.web.cern.ch/uploads/upload_7e155aeb5828f47ff0e30db9a91afea7.png)
----
![](https://codimd.web.cern.ch/uploads/upload_89da98dfb686dea44536acf909cd46fd.png)
---
## USE CASE 4: EOS in Memory to QuarkDB NS
Not a simple jump but achieved quickly and with confidence
```mermaid
gantt
dateFormat YYYY-MM-DD
title QuarkDB integration in CTA
section Play
first contact @EOS WS :a1, 2018-02-04, 2d
section CI
QuarkDB NS support added in CI :after a1 , 3d
QuarkDB stress tests and debugging in CI: 2018-02-25, 2d
QuarkDB stress tests and debugging in CI: 2018-03-03, 3d
section Releases
QuarkDB fixes for CTA in EOS 4.4.27: crit, 2018-03-07, 1d
QuarkDB OK in CTA v0.0-173: crit, 2018-03-08, 1d
```
Closely working with EOS developers.
----
<!-- .slide: data-background="https://codimd.web.cern.ch/uploads/upload_5d9829c67e93d31af306b734ef8a971d.png" data-background-size="80%" -->
---
<h1>THE END?</h1>
<ul>
<li class="fragment">Very powerful approach <font color="dodgerblue">addresses and federates all our development/testing use cases</font></li>
<li class="fragment">Fast, flexible, isolated and self contained in software repository</li>
<li class="fragment"><font color="crimson">Reproducible development environment</font> that allows regression and performance tests</li>
</ul>
----
## More to do
<ul>
<li class="fragment"><font color="dodgerblue">Automate Continuous deployment</font> in gitlab for stress tests?</li>
<li class="fragment"><font color="dodgerblue">Performance regression tests</font></li>
<li class="fragment"><font color="crimson">Bandwidth performance tests @4GB/s</font> on Beefy system</li>
<li class="fragment"><font color="crimson">CASTOR namespace ingestion tests</font>. Double it in CI stress instance?</li>
<li class="fragment">Add <font color="dodgerblue">full experiment data workflows for T0</font> in CI</li>
<li class="fragment">Evaluate possible production use ☺</li>
</ul>
<b class="fragment">THANK YOU FOR YOUR ATTENTION!</b>
---
{%vimeo 322128118 %}
{"title":"ITTF - System testing EOS-CTA using Kubernetes","description":"Presentation of CERN Tape Archive CI strategy","slideOptions":{"transition":"slide","theme":"white"}}