1994 views
--- title: Let's repack tape faster description: Presentation of CERN tape repack architecture disqus: hackmd-gm slideOptions: transition: slide theme: white --- # Let's <img src="https://codimd.web.cern.ch/uploads/upload_ab4df4a384042cbc7e0ac758700e8ee5.png" style="border: none;background: none;box-shadow:none" height="300"> faster ## <span style="color: dodgerblue">cheap, future-proof <span style="color: crimson">REPACK</span> infrastructure</span> [Julien Leduc](mailto:julien.leduc@cern.ch) --- ## Why <span style="color: crimson">REPACK</span>? > “Dear tape operations can you make some room in the tape libraries: more data is coming soon!” > > PS: keep everything you have, we may need to read it back, thanks! > > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The experiments --- ## How to <span style="color: crimson">REPACK</span>? We know how to <span><!-- .element: class="fragment highlight-blue" -->move data from disks to tapes and back</span>. Repack is <span><!-- .element: class="fragment highlight-green" -->easy</span>: just <span><!-- .element: class="fragment highlight-red" -->move the data from tapes to disks and back</span> (on high density tapes). ---- ### <span style="color: crimson">REPACK v2.0</span> architecture ```graphviz graph hierarchy { nodesep=1 // increases the separation between nodes node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour edge [color=Blue, label="10Gb/s"] //All the lines look like this Router [shape=circle] Router--{SwitchDisk} [label="3x40GB/s", fontsize=15, style=bold] Router--{SwitchTape} [label="7x20GB/s", fontsize=15, style=bold] subgraph cluster_level1{ label="Repack disk infrastructure\n3x13 diskservers" color=dodgerblue fontcolor=dodgerblue SwitchDisk disk1 [color=black, shape=cylinder] diskXX [color=black, shape=cylinder] disk20 [color=black, shape=cylinder] disksrv01 disksrvXX--{disk1 diskXX disk20} [label=""] } subgraph cluster_level2{ label="Tape infrastructure\n7*10 tapeservers" color=crimson fontcolor=crimson SwitchTape SwitchTape--{tpsrv01 tpsrvXX} SwitchDisk--{disksrv01 disksrvXX } {rank=same; tpsrv01 tpsrvXX} // Put them on the same level tape [color=black, shape=Msquare] tpsrvXX--tape [label="360MB/s"] } } ``` ---- ### <span style="color: crimson">REPACK v2.0</span> <span style="color: dodgerblue">pros</span> and <span style="color: sienna">cons</span> <span><!-- .element: class="fragment" data-fragment-index="1" -->- <span style="color: dodgerblue">Repack is **just another** *medium* disk instance</span></span> <span><!-- .element: class="fragment" data-fragment-index="2" -->- <span style="color: dodgerblue">Repack disks can be **specifically optimized**</span></span> <span><!-- .element: class="fragment" data-fragment-index="3" -->- <span style="color: sienna">Repack is **yet another** disk instance</span> <span><!-- .element: class="fragment" data-fragment-index="4" -->- <span style="color: Sienna">**Hard on network**</span></span> --- ## <span style="color: crimson">REPACK</span> v3.0 <span><!-- .element: class="fragment" data-fragment-index="1" -->- <span style="color: dodgerblue">Limit the cost of additional hardware</span>: no additional servers, no additional network infrastructure</span> <span><!-- .element: class="fragment" data-fragment-index="2" -->- <span style="color: dodgerblue">Remove network bottleneck</span>: repack cache fast and close to tapes</span> #### <span><!-- .element: class="fragment" data-fragment-index="3" -->Use the SSDs in our tapeservers!</span> ---- ### <span style="color: crimson">REPACK v3.0</span> architecture mockup ```graphviz graph hierarchy { nodesep=0.5 // increases the separation between nodes node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour edge [color=Blue, label="10Gb/s"] //All the lines look like this Switch--{tpsrv01 tpsrvXX tpsrv70} {rank=same; tpsrv01 tpsrvXX tpsrv70} tape [color=black, shape=Msquare] SSDs [color=black, shape=cylinder] tpsrvXX--SSDs [label="", style=bold] tpsrvXX--tape [label="360MB/s"] } ``` ---- ### <span style="color: crimson">REPACK v3.0</span> architecture in practice **Will it work?** <span><!-- .element: class="fragment" data-fragment-index="1" -->- <span style="color: dodgerblue">will this be fast enough?</span></span> <span><!-- .element: class="fragment" data-fragment-index="2" -->- <span style="color: dodgerblue">SSD life expectancy?</span></span> --- ### Some microbenchmarks: <span style="color: orangered">*SPEED*</span> On four independent SSDs: # streams | R speed(MB/s) | W speed (MB/s) --- | --- | --- 1 | 548 | 500 2 | 1040 | 1000 3 | **1580** | **1280** 4 | **1580** | **1280** <span><!-- .element: class="fragment" data-fragment-index="2" --><span style="color: crimson">**We are hitting a bottleneck inside the machine**</span></span> ---- ### Some microbenchmarks: <span style="color: orangered">*SPEED*</span> <img src="https://codimd.web.cern.ch/uploads/upload_72fc95ac7027ad2344cd850ff1c4a407.png" style="border: none;background: none;box-shadow:none" height=600> ---- ### Some microbenchmarks: <span style="color: orangered">*SPEED*</span> Unbalanced systems: <span><!-- .element: class="fragment highlight-red" -->single CPU systems design</span>. ```graphviz graph hierarchy { nodesep=1 // increases the separation between nodes node [color=Red, fontname=Courier, shape=box] edge [color=Blue, label=""] CPUBUS subgraph cluster_level1{ CPU1 [shape=circle] label="NUMA node 1" color=dodgerblue fontcolor=dodgerblue Memory1 [label="{<f0>Memory|<f1> 32GB}" shape=Mrecord color=black] SATA1 [label="{<f0>SATA|<f1> 4 SSDS}" shape=Mrecord color=black] Ethernet1 [label="{<f0>Ethernet|<f1> 2 NIC@1Gb/s\n 2 NIC@10Gb/s}" shape=Mrecord color=black] HBA1 [label="{<f0>FC HBA|<f1> 1 drive}" shape=Mrecord color=black] CPU1--{Memory1 SATA1 Ethernet1 HBA1} [label=""] } subgraph cluster_level2{ CPU2 [shape=circle] label="NUMA node 2" color=tomato fontcolor=tomato Memory2 [label="{<f0>Memory|<f1> 32GB}" shape=Mrecord color=black] SATA2 [label="{<f0>SATA|<f1> -- }" shape=Mrecord color=black] CPU2--{Memory2 SATA2} [label=""] } CPUBUS--{CPU1 CPU2} [style=bold] } ``` ---- ### Some microbenchmarks: <span style="color: orangered">*SPEED*</span> <span><!-- .element: class="fragment highlight-red" data-fragment-index="1" -->SATA is simplex</span> and in practice: $$ \sum_{i \in SSDs}(ReadSpeed_{i} + WriteSpeed_{i}) = 1.6\ GB/s $$ Speed is not optimal because of unbalanced server SATA topology <span><!-- .element: class="fragment highlight-green" data-fragment-index="2" -->but it is good enough</span>. <span><!-- .element: class="fragment" data-fragment-index="3" --><img src="https://i.imgur.com/6DASMR5.gif" style="border: none;background: none;box-shadow:none" height="300"></span> --- ### Will it age well :wine_glass: or not :fish: ? SSDs are composed of cells that are <span><!-- .element: class="fragment highlight-red" data-fragment-index="1" -->aging with write cycles</span>. Model | *Samsung MZ7LM960HCHP-00003* --- | --- MTBF | 2 000 000 Hours *(228 years)* TBW | 1 400 TB *(WAF=1)* <span><!-- .element: class="fragment" data-fragment-index="2" -->Infrastructure data aging expectancy: **392 PBW**</span> <span><!-- .element: class="fragment" data-fragment-index="3" --><span style="color: green">**Good for a few repacks...**</span></span> --- <!-- .slide: data-transition="fade-out" --> ### <span style="color: crimson">REPACK v3.0</span> architecture ```graphviz graph hierarchy { nodesep=0.5 // increases the separation between nodes node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour edge [color=Blue, label="10Gb/s"] //All the lines look like this Router [shape=circle] Router--{Switch1 Switch2} [label="60Gb/s", fontsize=15, style=bold] Switch1--{tpsrv01 tpsrvXX tpsrv40} Switch2--{tpsrv41 tpsrvYY tpsrv70} {rank=same; tpsrv01 tpsrvXX tpsrv40 tpsrv41} // Put them on the same level tape [color=black, shape=Msquare] SSDs [color=black, shape=cylinder] tpsrvXX--SSDs [label="1.6GB/s"] tpsrvXX--tape [label="360MB/s"] } ``` ---- <!-- .slide: data-transition="fade-in" --> ### <span style="color: crimson">REPACK v3.1</span> architecture ```graphviz graph hierarchy { nodesep=0.5 // increases the separation between nodes node [color=Red, fontname=Courier, shape=box] //All nodes will this shape and colour edge [color=Blue, label="10Gb/s"] //All the lines look like this Router [shape=circle] Router--{Switch1 Switch2} [label="60Gb/s", fontsize=15, style=bold] Switch1--{tpsrv01 tpsrvXX tpsrv35} Switch2--{tpsrv36 tpsrvYY tpsrv70} Switch1--Switch2 [color=crimson, style=bold, label="120Gb/s"] {rank=same; Switch1 Switch2} {rank=same; tpsrv01 tpsrvXX tpsrv35 tpsrv36} // Put them on the same level tape [color=black, shape=Msquare] SSDs [color=black, shape=cylinder] tpsrvXX--SSDs [label="1.6GB/s"] tpsrvXX--tape [label="360MB/s"] } ``` We have 2 stackable [Brocade 7750-48C switches](http://www.brocade.com/content/html/en/configuration-guide/fastiron-08030b-switchstackingguide/GUID-4117E358-1F97-4E3F-85A7-F3082C0CB904.html) --- ### <span style="color: crimson">REPACK v2.0</span> monitoring Lemon and network service traffic: <span><!-- .element: class="fragment" data-fragment-index="1" --><img src="https://codimd.web.cern.ch/uploads/upload_7e20e767b51972d2a6cd726f99ff73ec.png" style="border: none;background: none;box-shadow:none" height="300"></span> <span><!-- .element: class="fragment highlight-red" data-fragment-index="2" -->5 minute resolution, requires fast `switch:port` :left_right_arrow: `tape drive` mental translation...</span> ---- <style> @import url('https://fonts.googleapis.com/css?family=Metal+Mania'); </style> ### <span style="color: crimson">REPACK v3.x</span> monitoring <span style="color: lightslategray; font-family: 'Metal Mania', cursive;"><!-- .element: class="fragment" data-fragment-index="1" -->This is </span><span style="font-size: 80px; color: lightslategray; font-family: 'Metal Mania', cursive; text-shadow: 4px 4px 4px crimson;"><!-- .element: class="fragment" data-fragment-index="1" -->System Tap</span> <span><!-- .element: class="fragment highlight-blue" data-fragment-index="2" -->Realtime kernel device drivers metrics per second.</span> Collects **all bandwidth metrics**: - <span><!-- .element: class="fragment highlight-blue" data-fragment-index="3" -->tape drive read/write rate, IO time</span> - <span><!-- .element: class="fragment highlight-red" data-fragment-index="4" -->SSDs read/write rates</span> - <span><!-- .element: class="fragment highlight-red" data-fragment-index="5" -->network in/out rates</span> per process per protocol ---- ### <span style="color: crimson">REPACK v3.x</span> monitoring Relies on <span style="color: lightslategray; font-family: 'Metal Mania', cursive; text-shadow: 4px 4px 4px crimson;">SystemTap</span> instrumentation: - <span><!-- .element: class="fragment highlight-red" data-fragment-index="1" -->kind of bad situation in SLC6</span> (wrong headers...) - <span><!-- .element: class="fragment highlight-red" data-fragment-index="2" -->several broken metrics in CC7</span>, I reported 1 bug to RedHat (<span><!-- .element: class="fragment highlight-blue" data-fragment-index="3" -->fixed in 7.4) - <span><!-- .element: class="fragment highlight-red" data-fragment-index="4" -->I need to report another minor bug...</span> <span><!-- .element: class="fragment" data-fragment-index="5" --><span style="color: dodgerblue">**Production grade finally!!**</span></span> ---- ### <span style="color: crimson">REPACK v3.x</span> monitoring <a href="https://meter-cta.web.cern.ch/dashboard/db/perf" target="_blank"><img src="https://codimd.web.cern.ch/uploads/upload_ddc52d33a9a67e0fdfa95d2d189c0699.png" style="border: none;background: none;box-shadow:none" height=600></a> --- ## ToDos - [x] Think about the repack architecture - [x] Make sure it works - [x] Draw nice graphs - [ ] Repack 100PB <img src="https://media.giphy.com/media/Pr3ll8LR4ZCgg/giphy.gif" style="border: none;background: none;box-shadow:none" height=100%>