---
title: Let's repack tape faster
description: Presentation of CERN tape repack architecture
disqus: hackmd-gm
slideOptions:
transition: slide
theme: white
---
# Let's <img src="https://codimd.web.cern.ch/uploads/upload_ab4df4a384042cbc7e0ac758700e8ee5.png" style="border: none;background: none;box-shadow:none" height="300"> faster
## <span style="color: dodgerblue">cheap, future-proof <span style="color: crimson">REPACK</span> infrastructure</span>
[Julien Leduc](mailto:julien.leduc@cern.ch)
---
## Why <span style="color: crimson">REPACK</span>?
> “Dear tape operations can you make some room in the tape libraries: more data is coming soon!”
>
> PS: keep everything you have, we may need to read it back, thanks!
>
> The experiments
---
## How to <span style="color: crimson">REPACK</span>?
We know how to <span><!-- .element: class="fragment highlight-blue" -->move data from disks to tapes and back</span>.
Repack is <span><!-- .element: class="fragment highlight-green" -->easy</span>: just <span><!-- .element: class="fragment highlight-red" -->move the data from tapes to disks and back</span> (on high density tapes).
----
### <span style="color: crimson">REPACK v2.0</span> architecture
```graphviz
graph hierarchy {
nodesep=1 // increases the separation between nodes
node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour
edge [color=Blue, label="10Gb/s"] //All the lines look like this
Router [shape=circle]
Router--{SwitchDisk} [label="3x40GB/s", fontsize=15, style=bold]
Router--{SwitchTape} [label="7x20GB/s", fontsize=15, style=bold]
subgraph cluster_level1{
label="Repack disk infrastructure\n3x13 diskservers"
color=dodgerblue
fontcolor=dodgerblue
SwitchDisk
disk1 [color=black, shape=cylinder]
diskXX [color=black, shape=cylinder]
disk20 [color=black, shape=cylinder]
disksrv01
disksrvXX--{disk1 diskXX disk20} [label=""]
}
subgraph cluster_level2{
label="Tape infrastructure\n7*10 tapeservers"
color=crimson
fontcolor=crimson
SwitchTape
SwitchTape--{tpsrv01 tpsrvXX}
SwitchDisk--{disksrv01 disksrvXX }
{rank=same; tpsrv01 tpsrvXX} // Put them on the same level
tape [color=black, shape=Msquare]
tpsrvXX--tape [label="360MB/s"]
}
}
```
----
### <span style="color: crimson">REPACK v2.0</span> <span style="color: dodgerblue">pros</span> and <span style="color: sienna">cons</span>
<span><!-- .element: class="fragment" data-fragment-index="1" -->- <span style="color: dodgerblue">Repack is **just another** *medium* disk instance</span></span>
<span><!-- .element: class="fragment" data-fragment-index="2" -->- <span style="color: dodgerblue">Repack disks can be **specifically optimized**</span></span>
<span><!-- .element: class="fragment" data-fragment-index="3" -->- <span style="color: sienna">Repack is **yet another** disk instance</span>
<span><!-- .element: class="fragment" data-fragment-index="4" -->- <span style="color: Sienna">**Hard on network**</span></span>
---
## <span style="color: crimson">REPACK</span> v3.0
<span><!-- .element: class="fragment" data-fragment-index="1" -->- <span style="color: dodgerblue">Limit the cost of additional hardware</span>: no additional servers, no additional network infrastructure</span>
<span><!-- .element: class="fragment" data-fragment-index="2" -->- <span style="color: dodgerblue">Remove network bottleneck</span>: repack cache fast and close to tapes</span>
#### <span><!-- .element: class="fragment" data-fragment-index="3" -->Use the SSDs in our tapeservers!</span>
----
### <span style="color: crimson">REPACK v3.0</span> architecture mockup
```graphviz
graph hierarchy {
nodesep=0.5 // increases the separation between nodes
node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour
edge [color=Blue, label="10Gb/s"] //All the lines look like this
Switch--{tpsrv01 tpsrvXX tpsrv70}
{rank=same; tpsrv01 tpsrvXX tpsrv70}
tape [color=black, shape=Msquare]
SSDs [color=black, shape=cylinder]
tpsrvXX--SSDs [label="", style=bold]
tpsrvXX--tape [label="360MB/s"]
}
```
----
### <span style="color: crimson">REPACK v3.0</span> architecture in practice
**Will it work?**
<span><!-- .element: class="fragment" data-fragment-index="1" -->- <span style="color: dodgerblue">will this be fast enough?</span></span>
<span><!-- .element: class="fragment" data-fragment-index="2" -->- <span style="color: dodgerblue">SSD life expectancy?</span></span>
---
### Some microbenchmarks: <span style="color: orangered">*SPEED*</span>
On four independent SSDs:
# streams | R speed(MB/s) | W speed (MB/s)
--- | --- | ---
1 | 548 | 500
2 | 1040 | 1000
3 | **1580** | **1280**
4 | **1580** | **1280**
<span><!-- .element: class="fragment" data-fragment-index="2" --><span style="color: crimson">**We are hitting a bottleneck inside the machine**</span></span>
----
### Some microbenchmarks: <span style="color: orangered">*SPEED*</span>
<img src="https://codimd.web.cern.ch/uploads/upload_72fc95ac7027ad2344cd850ff1c4a407.png" style="border: none;background: none;box-shadow:none" height=600>
----
### Some microbenchmarks: <span style="color: orangered">*SPEED*</span>
Unbalanced systems: <span><!-- .element: class="fragment highlight-red" -->single CPU systems design</span>.
```graphviz
graph hierarchy {
nodesep=1 // increases the separation between nodes
node [color=Red, fontname=Courier, shape=box]
edge [color=Blue, label=""]
CPUBUS
subgraph cluster_level1{
CPU1 [shape=circle]
label="NUMA node 1"
color=dodgerblue
fontcolor=dodgerblue
Memory1 [label="{<f0>Memory|<f1> 32GB}" shape=Mrecord color=black]
SATA1 [label="{<f0>SATA|<f1> 4 SSDS}" shape=Mrecord color=black]
Ethernet1 [label="{<f0>Ethernet|<f1> 2 NIC@1Gb/s\n 2 NIC@10Gb/s}" shape=Mrecord color=black]
HBA1 [label="{<f0>FC HBA|<f1> 1 drive}" shape=Mrecord color=black]
CPU1--{Memory1 SATA1 Ethernet1 HBA1} [label=""]
}
subgraph cluster_level2{
CPU2 [shape=circle]
label="NUMA node 2"
color=tomato
fontcolor=tomato
Memory2 [label="{<f0>Memory|<f1> 32GB}" shape=Mrecord color=black]
SATA2 [label="{<f0>SATA|<f1> -- }" shape=Mrecord color=black]
CPU2--{Memory2 SATA2} [label=""]
}
CPUBUS--{CPU1 CPU2} [style=bold]
}
```
----
### Some microbenchmarks: <span style="color: orangered">*SPEED*</span>
<span><!-- .element: class="fragment highlight-red" data-fragment-index="1" -->SATA is simplex</span> and in practice:
$$
\sum_{i \in SSDs}(ReadSpeed_{i} + WriteSpeed_{i}) = 1.6\ GB/s
$$
Speed is not optimal because of unbalanced server SATA topology <span><!-- .element: class="fragment highlight-green" data-fragment-index="2" -->but it is good enough</span>.
<span><!-- .element: class="fragment" data-fragment-index="3" --><img src="https://i.imgur.com/6DASMR5.gif" style="border: none;background: none;box-shadow:none" height="300"></span>
---
### Will it age well :wine_glass: or not :fish: ?
SSDs are composed of cells that are <span><!-- .element: class="fragment highlight-red" data-fragment-index="1" -->aging with write cycles</span>.
Model | *Samsung MZ7LM960HCHP-00003*
--- | ---
MTBF | 2 000 000 Hours *(228 years)*
TBW | 1 400 TB *(WAF=1)*
<span><!-- .element: class="fragment" data-fragment-index="2" -->Infrastructure data aging expectancy: **392 PBW**</span>
<span><!-- .element: class="fragment" data-fragment-index="3" --><span style="color: green">**Good for a few repacks...**</span></span>
---
<!-- .slide: data-transition="fade-out" -->
### <span style="color: crimson">REPACK v3.0</span> architecture
```graphviz
graph hierarchy {
nodesep=0.5 // increases the separation between nodes
node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour
edge [color=Blue, label="10Gb/s"] //All the lines look like this
Router [shape=circle]
Router--{Switch1 Switch2} [label="60Gb/s", fontsize=15, style=bold]
Switch1--{tpsrv01 tpsrvXX tpsrv40}
Switch2--{tpsrv41 tpsrvYY tpsrv70}
{rank=same; tpsrv01 tpsrvXX tpsrv40 tpsrv41} // Put them on the same level
tape [color=black, shape=Msquare]
SSDs [color=black, shape=cylinder]
tpsrvXX--SSDs [label="1.6GB/s"]
tpsrvXX--tape [label="360MB/s"]
}
```
----
<!-- .slide: data-transition="fade-in" -->
### <span style="color: crimson">REPACK v3.1</span> architecture
```graphviz
graph hierarchy {
nodesep=0.5 // increases the separation between nodes
node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour
edge [color=Blue, label="10Gb/s"] //All the lines look like this
Router [shape=circle]
Router--{Switch1 Switch2} [label="60Gb/s", fontsize=15, style=bold]
Switch1--{tpsrv01 tpsrvXX tpsrv35}
Switch2--{tpsrv36 tpsrvYY tpsrv70}
Switch1--Switch2 [color=crimson, style=bold, label="120Gb/s"]
{rank=same; Switch1 Switch2}
{rank=same; tpsrv01 tpsrvXX tpsrv35 tpsrv36} // Put them on the same level
tape [color=black, shape=Msquare]
SSDs [color=black, shape=cylinder]
tpsrvXX--SSDs [label="1.6GB/s"]
tpsrvXX--tape [label="360MB/s"]
}
```
We have 2 stackable [Brocade 7750-48C switches](http://www.brocade.com/content/html/en/configuration-guide/fastiron-08030b-switchstackingguide/GUID-4117E358-1F97-4E3F-85A7-F3082C0CB904.html)
---
### <span style="color: crimson">REPACK v2.0</span> monitoring
Lemon and network service traffic:
<span><!-- .element: class="fragment" data-fragment-index="1" --><img src="https://codimd.web.cern.ch/uploads/upload_7e20e767b51972d2a6cd726f99ff73ec.png" style="border: none;background: none;box-shadow:none" height="300"></span>
<span><!-- .element: class="fragment highlight-red" data-fragment-index="2" -->5 minute resolution, requires fast `switch:port` :left_right_arrow: `tape drive` mental translation...</span>
----
<style>
@import url('https://fonts.googleapis.com/css?family=Metal+Mania');
</style>
### <span style="color: crimson">REPACK v3.x</span> monitoring
<span style="color: lightslategray; font-family: 'Metal Mania', cursive;"><!-- .element: class="fragment" data-fragment-index="1" -->This is </span><span style="font-size: 80px; color: lightslategray; font-family: 'Metal Mania', cursive; text-shadow: 4px 4px 4px crimson;"><!-- .element: class="fragment" data-fragment-index="1" -->System Tap</span>
<span><!-- .element: class="fragment highlight-blue" data-fragment-index="2" -->Realtime kernel device drivers metrics per second.</span>
Collects **all bandwidth metrics**:
- <span><!-- .element: class="fragment highlight-blue" data-fragment-index="3" -->tape drive read/write rate, IO time</span>
- <span><!-- .element: class="fragment highlight-red" data-fragment-index="4" -->SSDs read/write rates</span>
- <span><!-- .element: class="fragment highlight-red" data-fragment-index="5" -->network in/out rates</span> per process per protocol
----
### <span style="color: crimson">REPACK v3.x</span> monitoring
Relies on <span style="color: lightslategray; font-family: 'Metal Mania', cursive; text-shadow: 4px 4px 4px crimson;">SystemTap</span> instrumentation:
- <span><!-- .element: class="fragment highlight-red" data-fragment-index="1" -->kind of bad situation in SLC6</span> (wrong headers...)
- <span><!-- .element: class="fragment highlight-red" data-fragment-index="2" -->several broken metrics in CC7</span>, I reported 1 bug to RedHat (<span><!-- .element: class="fragment highlight-blue" data-fragment-index="3" -->fixed in 7.4)
- <span><!-- .element: class="fragment highlight-red" data-fragment-index="4" -->I need to report another minor bug...</span>
<span><!-- .element: class="fragment" data-fragment-index="5" --><span style="color: dodgerblue">**Production grade finally!!**</span></span>
----
### <span style="color: crimson">REPACK v3.x</span> monitoring
<a href="https://meter-cta.web.cern.ch/dashboard/db/perf" target="_blank"><img src="https://codimd.web.cern.ch/uploads/upload_ddc52d33a9a67e0fdfa95d2d189c0699.png" style="border: none;background: none;box-shadow:none" height=600></a>
---
## ToDos
- [x] Think about the repack architecture
- [x] Make sure it works
- [x] Draw nice graphs
- [ ] Repack 100PB
<img src="https://media.giphy.com/media/Pr3ll8LR4ZCgg/giphy.gif" style="border: none;background: none;box-shadow:none" height=100%>