2nd RUCIO Community Workshop CTA initial deployments

# <img src="https://codimd.web.cern.ch/uploads/upload_45a14e417e9a8ade007f06e7b9420356.png" style="border: none;background: none;box-shadow:none"> initial deployments [Julien Leduc](mailto:julien.leduc@cern.ch) --- ## Data archiving at CERN <ul> <li class="fragment">Ad aeternum storage</li> <li class="fragment">7 tape libraries, 83 tape drives, 20k tapes</li> <li class="fragment">Current use: 330 PB</li> <li class="fragment">Current capacity: 0.7 EB</li> <li class="fragment">Exponentially growing</li> </ul>  --- <h2>Data Archiving at CERN Evolution</h2> <ul> <li class="fragment">EOS + tapes...</li> <ul> <li class="fragment">EOS is CERN strategic storage platform</li> <li class="fragment">tape is the strategic long term archive medium</li> </ul> <li class="fragment">EOS + tapes = &hearts;</li> <ul> <li class="fragment">Meet CTA: CERN Tape Archive</li> <li class="fragment">Streamline data paths, software and infrastructure</li> </ul> </ul> --- <h2>EOS+CTA Deployment</h2> ----  ----  ----  --- <h2>EOS+CTA Architecture</h2> ----  --- <h2>EOS+CTA Timeline</h2> ----  --- <h2>EOS+CTA Dev&oper</h2> Tightly coupled software ⇒ tightly coupled developments Extensive and systematic testing is paramount to limit regressions Extensive monitoring in place to ease debugging and target high performance from day 1 ----  ---- ## For more information Come to my CERN IT Technical Forum presentation on 08/03/2019: [System testing service developments using Docker and Kubernetes: EOS + CTA use case](https://indico.cern.ch/e/CERN-ITTF-2019-03-08) --- # CTA VS experiment data transfers ---- ## ATLAS stage in Several tests conducted with Atlas DDM team using Rucio and FTS. - 2 stage in tests of 200TB each - ~90k files of 2.6GB archived to tape - sub-optimal EOS instance (2 slow disk servers) ---- ## ATLAS stage in <img src="https://codimd.web.cern.ch/uploads/upload_dfa6cf2e22f47bff0ff9f705a6fbe419.png" class="plain"> <img src="https://codimd.web.cern.ch/uploads/upload_8d18a04f89dfd4626a3c073a48f6717e.png" class="plain"> ---- ## ATLAS stage out aka *Tape carousel* test took place in October 2018: - 3 x EOS disk servers (~3x260TB of raw JBOD space) - 6-10 x T10KD tape drives - 90k files retrieved from EOSCTAATLASPPS (tape) to EOSATLAS by rucio through FTS ---- ## ATLAS stage out <img src="https://codimd.web.cern.ch/uploads/upload_cdff0f357f4522aabad54db96a12de84.png" class="plain"> ---- ## ATLAS stage out <img src="https://codimd.web.cern.ch/uploads/upload_f08082d31f8d0839404ca282d05d7fa7.png" class="plain"> ---- ## ATLAS stage out DDM <img src="https://codimd.web.cern.ch/uploads/upload_5a6394a3c1efa419f01d3c548edbb60e.png" class="plain"> 500MB/s of sustained performance per 288TB of disk... --- ## Run3 T0 archive architecture 4 LHC experiments will write at 60GB/s to the archival system. Scaling the current `eosctaatlaspps` would require approximately $288TB \times 2 \times 60=34.5PB$ of disk storage. This means 70PB of 2-replicas disk storage! Going to next gen disk servers: 1PB of raw disk 4GB/s bandwidth is 30PB of disk storage. ---- ## Run3 T0 archive architecture *evolution* Small faster cache close to the tapes that aims to contain $x$ hours of data traffic. Aggressively removing files from buffer to free up space. From Rucio point of view CERN EOSCTA endpoint is tape only. ---- ## ✅ to EOSCTA = ✅ on Tape Why is it so important for an archival endpoint? - data integrity checked during write (Logical Block Protection) - long term stable medium Data preservation on tape is a difficult enough topic. ---- ## Archival ```mermaid sequenceDiagram participant Experiment participant FTS participant EOS participant EOSCTA participant Tape Experiment->>FTS: archive(file) activate EOS FTS->>EOSCTA: xrdcp EOS:file EOS->>+EOSCTA: file loop until timeout FTS->>EOSCTA: file backup_bit ? alt backup_bit=1 activate Tape EOSCTA->>FTS: file on tape FTS->>Experiment: file archival OK EOSCTA->>-EOSCTA: delete file deactivate Tape else backup_bit=0 activate EOSCTA EOSCTA-xFTS: file NOT on tape FTS->>-EOSCTA: delete file FTS-xExperiment: file archival FAILED end end deactivate EOS ``` ---- ## Retrieval ```mermaid sequenceDiagram participant Experiment participant FTS participant EOS participant EOSCTA participant Tape Experiment->>FTS: retrieve(file) activate Tape FTS->>EOSCTA: xrdfs prepare file loop until timeout FTS->>EOSCTA: file online ? alt online_bit=1 Tape->>+EOSCTA: file activate EOSCTA EOSCTA->>FTS: file is online FTS->>EOS: xrdcp EOSCTA:file EOSCTA->>+EOS: file FTS->>Experiment: file retrieval OK EOSCTA->>-EOSCTA: delete file deactivate EOS else online_bit=0 EOSCTA-xFTS: file is NOT online FTS-xExperiment: file retrieval FAILED end end deactivate Tape ``` --- # CTA & Rucio ## ATLAS & CMS - Working with respective Rucio teams - PPS instances are up and running - will be upgraded next week - More capacity will be moved to CTA