93 views
# Hands-~~on~~ off session: CTA Logging and Monitoring By: Pablo Oliver Cortés, on behalf of the CTA Team ## CTA Architecture ![](https://codimd.web.cern.ch/uploads/upload_b28a94109e55ccff316aa0c720fc48ca.png) EOSCTA - Point of interaction of the user with CTA - Namespace management - File transfers CTA Frontend - Handle requests: archive, retrieve, ... - Serve cta-admin command requests. At CERN we have a dedicated frontend for operations/monitoring to avoid problems with user requests. Tape Servers - Run CTA's tape daemon: file transfer, tranfer report. - Connected to pysical tape drives. CatalogueDB (Oracle, PostgreSQL and SQLite) - Contains all the information of the system. - [Link to schema](https://gitlab.cern.ch:8443/cta/cta-catalogue-schema/-/tree/main?ref_type=heads) SchedulerDB (Ceph ObjectStore, VFS) - Persistent job queue ## Information at our disposal ### CTA Frontend - Log files: - /var/log/cta/cta-frontend.log - Archival request example: ``` [1710356878.367037000] Mar 13 20:07:58.367037 frontend01.cern.ch cta-frontend: LVL="INFO" PID="15045" TID="15134" MSG="In WorkflowEvent::WorkflowEvent(): received event." user="eosctabackup@frontend01" eventType="CREATE" eosInstance="eosctabackup" diskFilePath="/eos/ctabackup/archive/backup/back/DUMP/voldump.u1" diskFileId="4316428878" [1710356878.369693000] Mar 13 20:07:58.369693 frontend01.cern.ch cta-frontend: LVL="INFO" PID="15045" TID="15134" MSG="Checked request and got next archive file ID" user="eosctabackup@frontend01" instanceName="eosctabackup" username="uuuafs" usergroup="def-cg" storageClass="backup_afs" fileId="4760361441" catalogueTime="0.002525" schedulerDbTime="0.002525" [1710356878.370003000] Mar 13 20:07:58.370003 frontend01.cern.ch cta-frontend: LVL="INFO" PID="15045" TID="15134" MSG="In WorkflowEvent::processCREATE(): assigning new archive file ID." user="eosctabackup@frontend01" diskFileId="4316428878" diskFilePath="/eos/ctabackup/archive/backup/back/DUMP/voldump.u1" fileId="4760361441" schedulerTime="0.002714" [1710356881.722505000] Mar 13 20:08:01.722505 frontend01.cern.ch cta-frontend: LVL="INFO" PID="15045" TID="15286" MSG="In WorkflowEvent::WorkflowEvent(): received event." user="eosctabackup@frontend01" eventType="CLOSEW" eosInstance="eosctabackup" diskFilePath="/eos/ctabackup/archive/backup/back/DUMP/voldump.u1" diskFileId="4316428878" [1710356881.736463000] Mar 13 20:08:01.736463 frontend01.cern.ch cta-frontend: LVL="INFO" PID="15045" TID="15286" MSG="In OStoreDB::queueArchive(): recorded request for queueing (enqueueing posted to thread pool)." user="eosctabackup@frontend01" requesterInstance="eosctafst0212.cern.ch:fst.1106" fileId="4760361441" jobObject="ArchiveRequest-Frontend-frontend01.cern.ch-15045-20240313-15:52:38-0-73546" diskInstance="eosctabackup" diskFilePath="/eos/ctabackup/archive/backup/back/DUMP/voldump.u1" diskFileId="4316428878" agentReferencingTime="0.005399" insertionTime="0.001735" taskPostingTime="0.000085" taskQueueSize="1" totalTime="0.007219" [1710356881.738858000] Mar 13 20:08:01.738858 frontend01.cern.ch cta-frontend: LVL="INFO" PID="15045" TID="15286" MSG="Queued archive request" user="eosctabackup@frontend01" requesterInstance="eosctafst0212.cern.ch:fst.1106" instanceName="eosctabackup" storageClass="backup_afs" diskFileID="4316428878" fileSize="415424" fileId="4760361441" tapePool1="r_backup_afs_1" tapePool2="r_backup_afs_2" policyName="afs" policyArchiveMinAge="14400" policyArchivePriority="250" diskFilePath="/eos/ctabackup/archive/backup/back/DUMP/voldump.u1" diskFileOwnerUid="49089" diskFileGid="2766" archiveReportURL="eosQuery://eosc[...]cta_archive_file_id=4760361441" archiveErrorReportURL="eosQuery://eosc[...]file_id=4760361441&mgm.errmsg=" creationHost="frontend01" creationTime="1710356881" creationUser="eosctabackup" requesterName="uuuafs" requesterGroup="def-cg" srcURL="root://eosctaba[...].u1?eos.lfn=fxid:101477a4e" catalogueTime="0.003304" schedulerDbTime="0.007400" checksumType="ADLER32" checksumValue="3bb98e2e" [1710356881.772018000] Mar 13 20:08:01.772018 frontend01.cern.ch cta-frontend: LVL="INFO" PID="15045" TID="16462" MSG="In OStoreDB::queueArchive_bottomHalf(): Finished enqueueing request." jobObject="ArchiveRequest-Frontend-frontend01.cern.ch-15045-20240313-15:52:38-0-73546" fileId="4760361441" diskInstance="eosctabackup" diskFilePath="/eos/ctabackup/archive/backup/back/DUMP/voldump.u1" diskFileId="4316428878" creationAndRelockTime="0.001487" totalQueueingTime="0.022536" totalCommitTime="0.002796" totalQueueUnlockTime="0.002736" ownerResetTime="0.000110" lockReleaseTime="0.001095" agentOwnershipResetTime="0.004759" totalTime="0.035519" ``` - cta-admin command ```text [1710596971.379417000] Mar 16 14:49:31.379417 ctaproductionfrontend02.cern.ch cta-frontend: LVL="INFO" PID="386" TID="592" MSG="In Scheduler::authorizeAdmin(): success." user="tape-local@ctaproductionfrontend02" catalogueTime="0.001362" [1710596972.777156000] Mar 16 14:49:32.777156 ctaproductionfrontend02.cern.ch cta-frontend: LVL="INFO" PID="386" TID="592" MSG="In OStoreDB::getMountInfoNoLock(): success." user="tape-local@ctaproductionfrontend02" rootFetchNoLockTime="0.004438" fetchMountInfoTime="1.392979" [1710596974.677978000] Mar 16 14:49:34.677978 ctaproductionfrontend02.cern.ch cta-frontend: LVL="INFO" PID="386" TID="592" MSG="In Scheduler::getQueuesAndMountSummaries(): success." user="tape-local@ctaproductionfrontend02" catalogueVidToLogicalLibraryTime="0.104853" schedulerDbTime="1.398815" catalogueGetTapePoolTotalTime="1.509710" catalogueGetVoTotalTime="0.018451" catalogueGetTapesTotalTime="0.208910" catalogueDefaultRepackVoTime="0.001172" [1710596974.681058000] Mar 16 14:49:34.681058 ctaproductionfrontend02.cern.ch cta-frontend: LVL="INFO" PID="386" TID="592" MSG="In RequestMessage::process(): Admin command succeeded." user="tape-local@ctaproductionfrontend02" command="showqueues" status="success" adminTime="3.301448" ``` Most of the ERRORs you will see in the logs are bening, i.e, errors while executing cta-admin commands for user because tape is in repack state; invalid VID; etc. The important ones, and critical, are restarts of the frontend. ### EOSCTA - Log files: - /var/log/eos/mgm/xrdlog.mgm - /var/log/eos/mgm/WFE.log - /var/log/eos/mgm/error.log - /var/log/eos/fst.port_number\/xrdlog.fst - /var/log/eos/cta-fst-gcd/cta-fst-gcd.log - Extended attributes - Directory attributes: ``` sys.acl="u:42703:wxp+d,u:30065:rwxp,u:98119:rwxp,z:!u!d,u:0:+u" sys.archive.storage_class="na62dual" sys.attr.link="/eos/ctapublic/proc/cta/workflowarchive_na62" sys.eos.btime="1682807730.436278815" sys.forced.checksum="adler" sys.link.workflow.sync::abort_prepare.default="proto" sys.link.workflow.sync::archive_failed.default="proto" sys.link.workflow.sync::archived.default="proto" sys.link.workflow.sync::closew.default="proto" sys.link.workflow.sync::closew.retrieve_written="proto" sys.link.workflow.sync::create.default="proto" sys.link.workflow.sync::delete.default="proto" sys.link.workflow.sync::evict_prepare.default="proto" sys.link.workflow.sync::prepare.default="proto" sys.link.workflow.sync::retrieve_failed.default="proto" ``` Apart from ACLs it is important to keep in mind that EOS also respects UNIX permissions, this means that the owner of a file is free to do whatever with the file even if the ACL explicitly disallows it. Example: Some months ago an user contacted us because he was not able to list the contents of a directory although he could list the contents of the parent and child directories. ACL's were correctly configured and identical in all directories. The problem was that the credentials he was trying to use were expired and the client tool was falling back to UNIX authentication, at CERN, EOSCTA instances map every UNIX authentication attempt to the `nobody` user. This is how the UNIX permissions were configured for the directories: ``` d1::t0 drwxr-xr-+ 1 ams xv 9125623336431965 Jun 15 2015 Data d1::t0 dr-xr-x--+ 1 ams va 450671241502720 Oct 22 2019 FRAMES ``` The directory the user could list through UNIX had the read flag enabled for `others`, allowing `nobody` (user account) to list that directory. - File attributes: ``` sys.archive.error="" sys.archive.file_id="4761655599" sys.archive.storage_class="na62" sys.cta.archive.objectstore.id="" sys.eos.btime="1710712602.393969824" sys.fs.tracking="+70/70" sys.utrace="3d4a11b6-e4a9-11ee-8762-40a6b71da894" sys.vtrace="[Sun Mar 17 22:56:42 2024] uid:42703[na62cdr] gid:1338[vl] tident:fts3.12342:691@fts-daq-004 name:na62cdr dn: prot:gsi app: host:fts-daq-004.cern.ch domain:cern.ch geo: sudo:0" ``` - mgm logs EOS also generates extra log files per type of log: i.e, WFE only logs or error only, among many others. ``` 240317 22:31:52 time=1710711112.289318 func=IdMap level=INFO logid=static.............................. unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f1f883fe700 source=Mapping:994 tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=gsi sec.name="na62cdr" sec.host="fts-daq-005.cern.ch" sec.vorg="" sec.grps="" sec.role="" sec.info="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=na62cdr/CN=719989/CN=Robot: NA62 CDR" sec.app="" sec.tident="fts3.1833208:646@fts-daq-005" vid.uid=42703 vid.gid=1338 sudo=0 gateway=0 240317 22:31:52 time=1710711112.289392 func=chksum level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f1f883fe700 source=Chksum:98 tident=<single-exec> sec=gsi uid=42703 gid=1338 name=na62cdr geo="" path=/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011930.HADROP_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011930.v3.6.5-01_f.v3.6.5-01.011930-1481-1483_HADROP.root.1 240317 22:31:52 time=1710711112.289485 func=chksum level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f1f883fe700 source=Chksum:189 tident=<single-exec> sec=gsi uid=42703 gid=1338 name=na62cdr geo="" checksum="!adler32 463df811" 240317 22:31:52 time=1710711112.290481 func=IdMap level=INFO logid=static.............................. unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f20a2be5700 source=Mapping:994 tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=gsi sec.name="na62cdr" sec.host="fts-daq-005.cern.ch" sec.vorg="" sec.grps="" sec.role="" sec.info="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=na62cdr/CN=719989/CN=Robot: NA62 CDR" sec.app="" sec.tident="fts3.1833208:629@fts-daq-005" vid.uid=42703 vid.gid=1338 sudo=0 gateway=0 240317 22:31:52 time=1710711112.291509 func=SendProtoWFRequest level=INFO logid=static.............................. unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f2099d9e700 source=WFE:2778 tident= sec=(null) uid=99 gid=99 name=- geo="" protoWFEndPoint="ctaproductionfrontend.cern.ch:10955" protoWFResource="/ctafrontend" fullPath="/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011930.HADROP_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011930.v3.6.5-01_f.v3.6.5-01.011930-0638-0640_HADROP.root.1" event="sync::create" timeSpentMs=3 msg="Sent SSI protocol buffer request" 240317 22:31:52 time=1710711112.291858 func=open level=INFO logid=c51dcd98-e4a5-11ee-ac1c-40a6b71da894 unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f2099d9e700 source=XrdMgmOfsFile:3136 tident=fts3.19086:598@fts-daq-004 sec=gsi uid=42703 gid=1338 name=na62cdr geo="" msg="workflow trigger returned" retc=0 errno=0 240317 22:31:52 time=1710711112.292067 func=open level=INFO logid=c51dcd98-e4a5-11ee-ac1c-40a6b71da894 unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f2099d9e700 source=XrdMgmOfsFile:3201 tident=fts3.19086:598@fts-daq-004 sec=gsi uid=42703 gid=1338 name=na62cdr geo="" op=write path=/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011930.HADROP_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011930.v3.6.5-01_f.v3.6.5-01.011930-0638-0640_HADROP.root.1 info=oss.asize=388897166&tpc.dlg=eospublic.cern.ch:1094&tpc.dlgon=0&tpc.key=0b1dd63b418b4a8e65f76148&tpc.lfn=//eos/experiment/na62/data/offline/data/2022A/v3.6.5/filter/na62_2022.011930.HADROP_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011930.v3.6.5-01_f.v3.6.5-01.011930-0638-0640_HADROP.root.1&tpc.spr=root&tpc.src=p06636710m35016.cern.ch:1095&tpc.stage=copy&tpc.str=1&tpc.tpr=root&xrd.gsiusrpxy=/tmp/x509up_h4371142295655456483_70a057f967b0264d redirection=eosctafst0122.cern.ch?&cap.sym=<...>&cap.msg=<...>&mgm.logid=c51dcd98-e4a5-11ee-ac1c-40a6b71da894&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=10474aa7b&mgm.event=sync::closew&mgm.workflow=default&mgm.instance=eosctapublic&mgm.owner_uid=42703&mgm.owner_gid=1338&mgm.requestor=na62cdr&mgm.requestorgroup=vl&mgm.attributes=c3lzLmFyY2hpdmUuZmlsZV9pZD00NzYxNjUwNTIyOzs7c3lzLmFyY2hpdmUuc3RvcmFnZV9jbGFzcz1uYTYyOzs7c3lzLmVvcy5idGltZT0xNzEwNzExMTEyLjI4NzIxOTkyOQ== xrd_port=1116 http_port=8116 ``` User mapping example: ``` 240317 23:28:41 time=1710714521.582776 func=open level=INFO logid=b535cc66-e4ad-11ee-a9c8-40a6b71da894 unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f19223c5700 source=XrdMgmOfsFile:3201 tident=fts3.2657:636@fts-daq-002 sec=gsi uid=42703 gid=1338 name=na62cdr geo="" op=write path=/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011932.1TPNN_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011932.v3.6.5-01_f.v3.6.5-01.011932-1054-1056_1TPNN.root.1 info=oss.asize=1067786209&tpc.dlg=eospublic.cern.ch:1094&tpc.dlgon=0&tpc.key=115121b90ef50a6165f76e99&tpc.lfn=//eos/experiment/na62/data/offline/data/2022A/v3.6.5/filter/na62_2022.011932.1TPNN_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011932.v3.6.5-01_f.v3.6.5-01.011932-1054-1056_1TPNN.root.1&tpc.spr=root&tpc.src=p06636710u43698.cern.ch:1095&tpc.stage=copy&tpc.str=1&tpc.tpr=root&xrd.gsiusrpxy=/tmp/x509up_h4371142295655456483_70a057f967b0264d redirection=eosctafst0204.cern.ch?&cap.sym=<...>&cap.msg=<...>&mgm.logid=b535cc66-e4ad-11ee-a9c8-40a6b71da894&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=10474d0f7&mgm.event=sync::closew&mgm.workflow=default&mgm.instance=eosctapublic&mgm.owner_uid=42703&mgm.owner_gid=1338&mgm.requestor=na62cdr&mgm.requestorgroup=vl&mgm.attributes=c3lzLmFyY2hpdmUuZmlsZV9pZD00NzYxNjYxNzA2Ozs7c3lzLmFyY2hpdmUuc3RvcmFnZV9jbGFzcz1uYTYyOzs7c3lzLmVvcy5idGltZT0xNzEwNzE0NTIxLjU3NTA1NDg3OA== xrd_port=1112 http_port=8112 240317 23:28:41 time=1710714521.582874 func=open level=INFO logid=b535cc66-e4ad-11ee-a9c8-40a6b71da894 unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f19223c5700 source=XrdMgmOfsFile:3228 tident=fts3.2657:636@fts-daq-002 sec=gsi uid=42703 gid=1338 name=na62cdr geo="" path=/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011932.1TPNN_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011932.v3.6.5-01_f.v3.6.5-01.011932-1054-1056_1TPNN.root.1 open:rt=8.68 io:bw=inf io:sched=0 io:type=buffered io:prio=default io:redirect=eosctafst0204.cern.ch:1112 duration=8.677ms timing=IdMap=0.110ms fid::fetch=0.132ms fid::fetched=0.001ms authorize=0.139ms authorized=0.001ms path-computed=0.074ms container::fetched=0.205ms write::begin=0.109ms write::end=0.448ms Policy::begin=0.004ms Policy::end=0.104ms Scheduler::FilePlacement=0.177ms Scheduler::FilePlaced=0.045ms end=7.128ms Open=8.677ms 240317 23:28:41 157742 secgsi_XrdOucGMap::dn2user: mapping DN '/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=na62cdr/CN=719989/CN=Robot: NA62 CDR' to 'na62cdr' 240317 23:28:41 157742 cryptossl_X509::CertType: certificate has 3 extensions 240317 23:28:41 157742 XrootdXeq: fts3.3274:687@fts-daq-006 pub IP64 login as na62cdr 240317 23:28:41 time=1710714521.612183 func=IdMap level=INFO logid=static.............................. unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f19026c7700 source=Mapping:994 tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=gsi sec.name="na62cdr" sec.host="fts-daq-006.cern.ch" sec.vorg="" sec.grps="" sec.role="" sec.info="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=na62cdr/CN=719989/CN=Robot: NA62 CDR" sec.app="" sec.tident="fts3.3274:687@fts-daq-006" vid.uid=42703 vid.gid=1338 sudo=0 gateway=0 240317 23:28:41 time=1710714521.612277 func=open level=INFO logid=b53b9952-e4ad-11ee-b089-40a6b71da894 unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f19026c7700 source=XrdMgmOfsFile:549 tident=fts3.3274:687@fts-daq-006 sec=gsi uid=42703 gid=1338 name=na62cdr geo="" op=write trunc=0 path=/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011932.1TPNN_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011932.v3.6.5-01_f.v3.6.5-01.011932-1060-1062_1TPNN.root.1 info=oss.asize=1058863357&tpc.dlg=eospublic.cern.ch:1094&tpc.dlgon=0&tpc.key=18dc84e812040cca65f76e99&tpc.lfn=//eos/experiment/na62/data/offline/data/2022A/v3.6.5/filter/na62_2022.011932.1TPNN_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011932.v3.6.5-01_f.v3.6.5-01.011932-1060-1062_1TPNN.root.1&tpc.spr=root&tpc.src=st-096-hh1519b1.cern.ch:1095&tpc.stage=copy&tpc.str=1&tpc.tpr=root&xrd.gsiusrpxy=/tmp/x509up_h4371142295655456483_70a057f967b0264d 240317 23:28:41 time=1710714521.612534 func=open level=INFO logid=b53b9952-e4ad-11ee-b089-40a6b71da894 unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f19026c7700 source=XrdMgmOfsFile:698 tident=fts3.3274:687@fts-daq-006 sec=gsi uid=42703 gid=1338 name=na62cdr geo="" msg="rewrote symlinks" sym-path=/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011932.1TPNN_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011932.v3.6.5-01_f.v3.6.5-01.011932-1060-1062_1TPNN.root.1 realpath=/eos/ctapublic/archive/na62/offline/data/2022A/v3.6.5/filter/na62_2022.011932.1TPNN_p.v3.6.5-01_f.v3.6.5-01/na62_2022.011932.v3.6.5-01_f.v3.6.5-01.011932-1060-1062_1TPNN.root.1 240317 23:28:41 time=1710714521.613043 func=open level=INFO logid=b53b9952-e4ad-11ee-b089-40a6b71da894 unit=mgm@eosctafst0203.cern.ch:1094 tid=00007f19026c7700 source=XrdMgmOfsFile:1095 tident=fts3.3274:687@fts-daq-006 sec=gsi uid=42703 gid=1338 name=na62cdr geo="" acl=1 r=0 w=1 wo=0 egroup=0 shared=0 mutable=1 facl=0 ``` - fst logs ``` 240315 22:40:33 time=1710711633.656232 func=ProcessCapOpaque level=INFO logid=fbe0294c-e4a6-11ee-a420-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7813319700 source=XrdFstOfsFile:2750 tident=cta.13006:63@tpsrv432 sec=(null) uid=99 gid=99 name=(null) geo="" capability=&tapeenabled=1&mgm.access=create&mgm.ruid=0&mgm.rgid=0&mgm.uid=99&mgm.gid=99&mgm.path=/eos/ctapublic/archive/na62/data/raw/2017/007956/na62raw_1502229302-02-007956-0370.dat&mgm.manager=eosctafst0203.cern.ch:1094&mgm.fid=616e4e16&mgm.cid=1634596379&mgm.sec=sss|daemon|tpsrv432.cern.ch||daemon|||cta-taped&mgm.lid=1048594&mgm.bookingsize=3183405396&mgm.targetsize=3183405396&mgm.fsid=39&mgm.url0=root://eosctafst0203.cern.ch:1110//&mgm.fsid0=39&cap.valid=1710715233 240315 22:40:33 time=1710711633.656268 func=open level=INFO logid=fbe0294c-e4a6-11ee-a420-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7813319700 source=XrdFstOfsFile:256 tident=cta.13006:63@tpsrv432 sec=(null) uid=0 gid=0 name=nobody geo="" ns_path=/eos/ctapublic/archive/na62/data/raw/2017/007956/na62raw_1502229302-02-007956-0370.dat fst_path=/data1110/00027e85/616e4e16 240315 22:40:33 time=1710711633.656340 func=open level=INFO logid=fbe0294c-e4a6-11ee-a420-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7813319700 source=XrdFstOfsFile:515 tident=cta.13006:63@tpsrv432 sec=(null) uid=0 gid=0 name=nobody geo="" fst_path=/data1110/00027e85/616e4e16 open-mode=300 create-mode=4180 layout-name=replica oss-opaque=&mgm.lid=1048594&mgm.bookingsize=3183405396 240315 22:40:33 time=1710711633.656365 func=fileOpen level=INFO logid=fbe0859a-e4a6-11ee-85a8-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7813319700 source=LocalIo:70 tident=<service> sec= uid=0 gid=0 name= geo="" flags=300 path=/data1110/00027e85/616e4e16 240315 22:40:33 time=1710711633.656634 func=open level=INFO logid=fbe0294c-e4a6-11ee-a420-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7813319700 source=XrdFstOfsFile:783 tident=cta.13006:63@tpsrv432 sec=(null) uid=0 gid=0 name=nobody geo="" open-duration=0.580ms path='/eos/ctapublic/archive/na62/data/raw/2017/007956/na62raw_1502229302-02-007956-0370.dat' fxid=616e4e16 path::print=0.227ms creation::barrier=0.056ms layout::exists=0.004ms clone::fst=0.009ms layout::open=0.013ms layout::opened=0.153ms get::localfmd=0.000ms resync::localfmd=0.060ms layout::stat=0.004ms full::mutex=0.000ms layout::fallocate=0.001ms layout::fallocated=0.023ms fileio::object=0.024ms open::accounting=0.004ms end=0.002ms open=0.580ms 240315 22:40:33 time=1710711633.656649 func=stat level=INFO logid=fbe0294c-e4a6-11ee-a420-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7813319700 source=XrdFstOfsFile:1308 tident=cta.13006:63@tpsrv432 sec= uid=0 gid=0 name=nobody geo="" path=/eos/ctapublic/archive/na62/data/raw/2017/007956/na62raw_1502229302-02-007956-0370.dat fxid=616e4e16 size=0 mtime=1710711633.119064512 240315 22:40:35 time=1710711635.449610 func=_close level=INFO logid=f67e7904-e4a6-11ee-a1d9-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7827bbd700 source=XrdFstOfsFile:1548 tident=cta.8123:43@tpsrv313.ipv6 sec= uid=0 gid=0 name=nobody geo="" viaDelete=0 writeDelete=0 240315 22:40:35 time=1710711635.449725 func=VerifyChecksum level=INFO logid=f67e7904-e4a6-11ee-a1d9-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7827bbd700 source=XrdFstOfsFile:3623 tident=cta.8123:43@tpsrv313.ipv6 sec= uid=0 gid=0 name=nobody geo="" (write) checksum type: adler checksum hex: 5da98059 requested-checksum hex: -none- 240315 22:40:35 time=1710711635.451155 func=_close level=INFO logid=static.............................. unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7827bbd700 source=XrdFstOfsFile:2036 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="log=f67e7904-e4a6-11ee-a1d9-40a6b71da894&path=/eos/ctapublic/archive/na62/data/raw/2017/007968/na62raw_1502373680-02-007968-0553.dat&fstpath=/data1110/00027eaa/6173d75e&ruid=0&rgid=0&td=cta.8123:43@tpsrv313.ipv6&host=eosctafst0203.cern.ch&lid=1048594&fid=1634981726&fsid=39&ots=1710711624&otms=628&cts=1710711635&ctms=451&nrc=0&nwc=1764&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=3082171748&wb_min=406884&wb_max=2097152&wb_sigma=495053.33&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=27.38&iot=10822.914&idt=7860.055&lrt=0.000&lrvt=0.000&lwt=2960.728&ot=0.605&ct=1.526&rt=0.00&rvt=0.00&wt=1921.70&osize=0&csize=3082171748&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=sss&sec.name=daemon&sec.host=tpsrv313.ipv6.cern.ch&sec.vorg=&sec.grps=daemon&sec.role=&sec.info=&sec.app=cta-taped" 240315 22:40:35 time=1710711635.451184 func=_close level=INFO logid=f67e7904-e4a6-11ee-a1d9-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7827bbd700 source=XrdFstOfsFile:2254 tident=cta.8123:43@tpsrv313.ipv6 sec= uid=0 gid=0 name=nobody geo="" msg="notify" event="sync::closew" workflow="retrieve_written" 240315 22:40:35 time=1710711635.451801 func=_close level=INFO logid=f67e7904-e4a6-11ee-a1d9-40a6b71da894 unit=fst@eosctafst0203.cern.ch:1110 tid=00007f7827bbd700 source=XrdFstOfsFile:2267 tident=cta.8123:43@tpsrv313.ipv6 sec= uid=0 gid=0 name=nobody geo="" msg="done close" rc=0 errc=0 ``` - cta-fst-gcd Independent configurable script that tries to keep the EOSCTA buffer space free, based on file age and free space left on the instance. ``` 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Number of local file systems is 15" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 0: path=/data1101 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 1: path=/data1102 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 2: path=/data1103 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 3: path=/data1104 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 4: path=/data1105 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 5: path=/data1106 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 6: path=/data1107 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 7: path=/data1108 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 8: path=/data1109 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 9: path=/data1110 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 10: path=/data1111 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 11: path=/data1112 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 12: path=/data1113 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 13: path=/data1114 eos_space=retrieve" 2024/03/15 22:24:06.877000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="Local file system 14: path=/data1115 eos_space=retrieve" 2024/03/15 22:24:07.264000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="stagerrm: sub_dir=/data1103/00027d86, fxid=6147666e, bytes_required_before=0, file_size_bytes=2483327168, absolute_max_age_reached=True, should_free_space=False, gc_age_reached=True" 2024/03/15 22:24:08.067000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="stagerrm: sub_dir=/data1107/00069db7, fxid=10270c5c8, bytes_required_before=0, file_size_bytes=1824522240, absolute_max_age_reached=True, should_free_space=False, gc_age_reached=True" 2024/03/15 22:24:08.566000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="stagerrm: sub_dir=/data1110/00027d87, fxid=614778f9, bytes_required_before=0, file_size_bytes=2629750152, absolute_max_age_reached=True, should_free_space=False, gc_age_reached=True" 2024/03/15 22:24:08.994000 eosctafst0203.cern.ch INFO cta-fst-gcd:LVL="INFO" PID="268741" TID="268741" MSG="stagerrm: sub_dir=/data1112/00027d87, fxid=6147870b, bytes_required_before=0, file_size_bytes=2717230524, absolute_max_age_reached=True, should_free_space=False, gc_age_reached=True" ``` ### Tape server - Log files - /var/log/cta/cta-taped.log - /var/log/cta/cta-rmcd.log - RMCD logs ``` 03/17 22:26:25 10800 rmc_srv_unmount: returns 0 03/17 22:26:56 10800 rmc_srv_mount: RMC92 - mount request by 1000,33 from localhost 03/17 22:26:56 10800 rmc_srv_mount: RMC98 - mount L96171/0 on drive 6 03/17 22:27:18 10800 rmc_srv_mount: returns 0 03/17 22:50:22 10800 rmc_srv_unmount: RMC92 - unmount request by 1000,33 from localhost 03/17 22:50:22 10800 rmc_srv_unmount: RMC98 - unmount L96171 6 0 03/17 22:50:24 10800 rmc_srv_unmount: returns 0 03/17 23:00:14 10800 rmc_srv_mount: RMC92 - mount request by 1000,33 from localhost 03/17 23:00:14 10800 rmc_srv_mount: RMC98 - mount L88559/0 on drive 6 03/17 23:00:37 10800 rmc_srv_mount: returns 0 ``` - Taped logs ``` [1706590865.645965000] Jan 30 06:01:05.645965 tpsrv455.cern.ch cta-taped: LVL="INFO" PID="64076" TID="5056" MSG="Opened disk file for writing" thread="DiskWrite" tapeDrive="I4600213" tapeVid="I40083" mountId="1149771" vo="REPACK" tapePool="r_public_delete" threadCount="10" threadID="9" fileId="1305936161" dstURL="root://eosctarepack//eos/ctarepack/production/I40083/000339824?oss.asize=2426880" fSeq="339824" actualURL="root://eosctafst0228.cern.ch:1103//eos/ctarepack/production/I40083/000339824?cap.msg=Zpo8+SjZLPsd8WV8s8wBDNDVbuGOfTiH9gEh2zezG1ZPG/+Uj+DTWF0HQS6Vfr0tI99+tQ/xjpL7SlSgGahjlD2PY89hgZZ3Wf9ZezlyWGjjhQBTrYgcr323ciP1lT4+OTX65vFDXXcpS2RTXPjBAz2/VfoFU9nlhN0iCSiFIps3LE9lLtsek3jPI8vEFeaAKnJXSEWU8d0gXkvIYlG62piMUY3YcJGK1TogrfZlTYTlCMA4BQxhfbFXwBMTScOwbiBYfa2bhwHxVFz85CsDjv4mVn5tFjVsgq6Rq3oT/W/hgi4BvBbtHfnkANasHSIlgMBNRQfYNk25rMYWLpiOPDsfJWfVAIWfQTAcpekvImBx6XdJ7Gg5oPqvTZE7E72NeNHSjA8xjH4hO1PNP6a0iaQqdKVilH0WwxpWrJQN7tPhTMCF+4HOHaoiBge3b3i45lmlRC/yZUKOnTV1JrBX2y+1TGWTH6sH&cap.sym=Au5Et6hdMMQNV7co47hUYoqBn7o=&eos.clientinfo=zbase64:MDAwMDAwNjl4nBXIUQqAIBBF0a20ArH6KIRZTI1vUAiVcSLaffl37q0NJaiRd9s+5RrOh3KRoc4JkfygvQ103iJQxDGa5koRctyXjf53VrARamc7pJtflt0xtDhOYZ79+gGnBiSx&mgm.id=10682c460&mgm.logid=92bdbb7c-bf2c-11ee-853a-40a6b71daa4c&mgm.replicahead=0&mgm.replicaindex=0&oss.asize=2426880&xrdcl.requuid=7b04a664-acc8-4297-bf5f-005d56349c0f" [1706590865.653473000] Jan 30 06:01:05.653473 tpsrv455.cern.ch cta-taped: LVL="INFO" PID="64076" TID="5056" MSG="File successfully transfered to disk" thread="DiskWrite" tapeDrive="I4600213" tapeVid="I40083" mountId="1149771" vo="REPACK" tapePool="r_public_delete" threadCount="10" threadID="9" fileId="1305936161" dstURL="root://eosctarepack//eos/ctarepack/production/I40083/000339824?oss.asize=2426880" fSeq="339824" actualURL="root://eosctafst0228.cern.ch:1103//eos/ctarepack/production/I40083/000339824?cap.msg=Zpo8+SjZLPsd8WV8s8wBDNDVbuGOfTiH9gEh2zezG1ZPG/+Uj+DTWF0HQS6Vfr0tI99+tQ/xjpL7SlSgGahjlD2PY89hgZZ3Wf9ZezlyWGjjhQBTrYgcr323ciP1lT4+OTX65vFDXXcpS2RTXPjBAz2/VfoFU9nlhN0iCSiFIps3LE9lLtsek3jPI8vEFeaAKnJXSEWU8d0gXkvIYlG62piMUY3YcJGK1TogrfZlTYTlCMA4BQxhfbFXwBMTScOwbiBYfa2bhwHxVFz85CsDjv4mVn5tFjVsgq6Rq3oT/W/hgi4BvBbtHfnkANasHSIlgMBNRQfYNk25rMYWLpiOPDsfJWfVAIWfQTAcpekvImBx6XdJ7Gg5oPqvTZE7E72NeNHSjA8xjH4hO1PNP6a0iaQqdKVilH0WwxpWrJQN7tPhTMCF+4HOHaoiBge3b3i45lmlRC/yZUKOnTV1JrBX2y+1TGWTH6sH&cap.sym=Au5Et6hdMMQNV7co47hUYoqBn7o=&eos.clientinfo=zbase64:MDAwMDAwNjl4nBXIUQqAIBBF0a20ArH6KIRZTI1vUAiVcSLaffl37q0NJaiRd9s+5RrOh3KRoc4JkfygvQ103iJQxDGa5koRctyXjf53VrARamc7pJtflt0xtDhOYZ79+gGnBiSx&mgm.id=10682c460&mgm.logid=92bdbb7c-bf2c-11ee-853a-40a6b71daa4c&mgm.replicahead=0&mgm.replicaindex=0&oss.asize=2426880&xrdcl.requuid=7b04a664-acc8-4297-bf5f-005d56349c0f" readWriteTime="0.004860" checksumingTime="0.000811" waitDataTime="0.000009" waitReportingTime="0.000014" checkingErrorTime="0.000001" openingTime="0.002454" closingTime="0.001670" transferTime="0.009809" totalTime="0.009819" dataVolume="2426880" globalPayloadTransferSpeedMBps="247.161625" diskPerformanceMBps="247.413600" openRWCloseToTransferTimeRatio="0.915894" [1706590865.657914000] Jan 30 06:01:05.657914 tpsrv455.cern.ch cta-taped: LVL="INFO" PID="64076" TID="5056" MSG="Opened disk file for writing" thread="DiskWrite" tapeDrive="I4600213" tapeVid="I40083" mountId="1149771" vo="REPACK" tapePool="r_public_delete" threadCount="10" threadID="9" fileId="1498901148" dstURL="root://eosctarepack//eos/ctarepack/production/I40083/000339825?oss.asize=47998082" fSeq="339825" actualURL="root://eosctafst0014.cern.ch:1109//eos/ctarepack/production/I40083/000339825?cap.msg=Zpo8+SjZLPsd8WV8s8wBDNDVbuGOfTiH9gEh2zezG1ZPG/+Uj+DTWF0HQS6Vfr0tI99+tQ/xjpL7SlSgGahjlD2PY89hgZZ3Wf9ZezlyWGjjhQBTrYgcr323ciP1lT4+OTX65vFDXXcpS2RTXPjBAz2/VfoFU9nl+ExlHWtU0HIwWH/XHphysQwLJ8GrJaoKRRxVCh97GltIhgpNKc1vJDJTWyZorYE1HlJFBIxEdHqDzb7tDu/QlEnvrU6QEnwRPYafLc/NpkhzNwX0C1mrV+Vuh2iQfk8VgHeURH8vbk2mWbfA/WQxGvbKoymqNkzpsdsomoOBrZaQtWjMrFrGEh6cpigj+gZqN4dm039PpiRmElczFTJI5N1Bvf3NFWnbI0VMQxt+vSq55IBbf1ppvC1OGp1+R3bUoaky3PJDc0S0jxdpzj8ROuh69hWRl7A400R2ic0/B5t82NbW/w3Ujppk5LafpMlA&cap.sym=Au5Et6hdMMQNV7co47hUYoqBn7o=&eos.clientinfo=zbase64:MDAwMDAwNjl4nBXIUQqEMAxF0a24gpLCgFjIYjR9wcLQljQi7l77d+5tHTWZM4V1W0pLx82l6tSQE5lp0p8OPi5VGPIc3UrjDN2vv8/+djGIM9oQ33U4UfwFgdUgZ4qRthenAySx&mgm.id=10682c461&mgm.logid=92bf41fe-bf2c-11ee-853a-40a6b71daa4c&mgm.replicahead=0&mgm.replicaindex=0&oss.asize=47998082&xrdcl.requuid=bd9b63ae-f76d-4859-abd9-312e0187adcc" [1706590865.678396000] Jan 30 06:01:05.678396 tpsrv455.cern.ch cta-taped: LVL="INFO" PID="64076" TID="5045" MSG="File successfully read from tape" thread="TapeRead" tapeDrive="I4600213" tapeVid="I40083" mountId="1149771" vo="REPACK" tapePool="r_public_delete" mediaType="3592JC7T" logicalLibrary="IBM460" mountType="Retrieve" labelFormat="0000" vendor="IBM" capacityInBytes="7000000000000" fileId="634947377" BlockId="30770464" fSeq="340291" dstURL="root://eosctarepack//eos/ctarepack/production/I40083/000340291?oss.asize=50040595" isRepack="1" isVerifyOnly="0" positionTime="0.002494" readWriteTime="0.151360" waitFreeMemoryTime="0.000011" waitReportingTime="0.000058" transferTime="0.151429" totalTime="0.153907" dataVolume="50040595" headerVolume="480" driveTransferSpeedMBps="325.138395" payloadTransferSpeedMBps="325.135276" LBPMode="LBP_On" repackFilesCount="1" repackBytesCount="50040595" userFilesCount="0" userBytesCount="0" verifiedFilesCount="0" verifiedBytesCount="0" checksumType="ADLER32" checksumValue="15fa402f" [1706590865.680837000] Jan 30 06:01:05.680837 tpsrv455.cern.ch cta-taped: LVL="INFO" PID="64076" TID="5045" MSG="Successfully positioned for reading" thread="TapeRead" tapeDrive="I4600213" tapeVid="I40083" mountId="1149771" vo="REPACK" tapePool="r_public_delete" mediaType="3592JC7T" logicalLibrary="IBM460" mountType="Retrieve" labelFormat="0000" vendor="IBM" capacityInBytes="7000000000000" fileId="1550685928" BlockId="30770664" fSeq="340292" dstURL="root://eosctarepack//eos/ctarepack/production/I40083/000340292?oss.asize=6778880" isRepack="1" isVerifyOnly="0" [1706590865.713775000] Jan 30 06:01:05.713775 tpsrv455.cern.ch cta-taped: LVL="INFO" PID="64076" TID="5045" MSG="File successfully read from tape" thread="TapeRead" tapeDrive="I4600213" tapeVid="I40083" mountId="1149771" vo="REPACK" tapePool="r_public_delete" mediaType="3592JC7T" logicalLibrary="IBM460" mountType="Retrieve" labelFormat="0000" vendor="IBM" capacityInBytes="7000000000000" fileId="1550685928" BlockId="30770664" fSeq="340292" dstURL="root://eosctarepack//eos/ctarepack/production/I40083/000340292?oss.asize=6778880" isRepack="1" isVerifyOnly="0" positionTime="0.002409" readWriteTime="0.032733" waitFreeMemoryTime="0.000002" waitReportingTime="0.000016" transferTime="0.032751" totalTime="0.035144" dataVolume="6778880" headerVolume="480" driveTransferSpeedMBps="192.902345" payloadTransferSpeedMBps="192.888687" LBPMode="LBP_On" repackFilesCount="1" repackBytesCount="6778880" userFilesCount="0" userBytesCount="0" verifiedFilesCount="0" verifiedBytesCount="0" checksumType="ADLER32" checksumValue="fe3be773" ``` ### SchedulerDB To obtain the information a dump of the desired object in the ObjectStore is generated. In general, log lines, from different CTA componenets, related with object store operations contain the reference of the associated object, metadata of files in queues for archival or retrieval also contain metadata for the corresponding object. Tool: `cta-objectstore-dump-object $OBJECT_NAME` - Root entry example ```text Object store path: rados://cta-production@tapecta:cta-production Object name: root Header dump: { "type": "RootEntry_t", "version": "0", "owner": "", "backupowner": "", "payload": "..." } Body dump: { "archiveQueueToTransferForUserPointers": [ { "address": "ArchiveQueueToTransferForUser-r_cms_fam-Frontend-ctaproductionfrontend01.cern.ch-15045-20240313-15:52:38-0-1467100", "name": "r_cms_fam" }, ... ], "archiveQueueFailedPointers": [ { "address": "ArchiveQueueFailed-r_backup_zvalt2-Maintenance-tpsrv682.cern.ch-8917-20230727-14:17:03-0-2", "name": "r_backup_zvalt2" }, ], "archiveQueueToReportForUserPointers": [], "archiveQueueToTransferForRepackPointers": [ { "address": "ArchiveQueueToTransferForRepack-r_cms_fam-Maintenance-tpsrv323.cern.ch-17385-20240314-17:23:51-0-24", "name": "r_cms_fam" }, ], "archiveQueueToReportToRepackForSuccessPointers": [], "archiveQueueToReportToRepackForFailurePointers": [], "retrieveQueueToTransferForUserPointers": [ { "address": "RetrieveQueueToTransferForUser-L96474-Frontend-ctaproductionfrontend01.cern.ch-15045-20240313-15:52:38-0-164636", "vid": "L96474" }, ], "retrieveQueueToReportToRepackForSuccessPointers": [], "retrieveQueueToReportToRepackForFailurePointers": [], "retrieveQueueToTransferForRepackPointers": [], "driveregisterpointer": { "address": "DriveRegister-cta-objectstore-initialize-ctaproductionfrontend01.cern.ch-25378-20210215-11:11:16-0-1", "log": { "username": "user0", "host": "systemhost", "time": "1613383876" } }, "agentregisterpointer": { "address": "AgentRegister-cta-objectstore-initialize-ctaproductionfrontend01.cern.ch-25378-20210215-11:11:16-0-0", "log": { "username": "user0", "host": "systemhost", "time": "1613383876" } }, "repackindexpointer": { "address": "RepackIndex-cta-objectstore-create-missing-repack-index-ctaproductionfrontend02.cern.ch-23769-20240314-17:23:50-0-0" }, "repackrequestspendingqueuepointer": { "address": "RepackQueuePending-Frontend-ctaproductionfrontend02.cern.ch-24020-20240314-12:13:07-0-257" }, "repackrequeststoexpandqueuepointer": { "address": "RepackQueueToExpand-Maintenance-tpsrv005.cern.ch-21986-20240314-17:23:50-0-0" }, "agentregisterintent": "", "schedulerlockpointer": { "address": "SchedulerGlobalLock-cta-objectstore-initialize-ctaproductionfrontend01.cern.ch-25378-20210215-11:11:16-0-2", "log": { "username": "user0", "host": "systemhost", "time": "1613383876" } } } ``` - Archive queue example ```text Object name: ArchiveQueueToTransferForUser-r_ams_frames_1-Frontend-ctaproductionfrontend01.cern.ch-19591-20231002-15:49:49-0-2700508 Header dump: { "type": "ArchiveQueue_t", "version": "0", "owner": "root", "backupowner": "root", "payload": "..." } Body dump: { "tapepool": "r_ams_frames_1", "archivequeueshards": [ { "address": "ArchiveQueueShard-r_ams_frames_1-Frontend-ctaproductionfrontend01.cern.ch-19591-20231002-15:49:49-0-2700509", "shardjobscount": "1", "shardbytescount": "67691356160" } ], "prioritymap": [ { "value": "250", "count": "1" } ], "minarchiverequestagemap": [ { "value": "14400", "count": "1" } ], "mountpolicynamemap": [ { "value": "ams", "count": "1" } ], "archivejobstotalsize": "67691356160", "archivejobscount": "1", "oldestjobcreationtime": "1697000946", "mapsrebuildcount": "0", "youngestjobcreationtime": "1697000946" } ``` ## Monitoring Overview Monitoring plots are generated by: - Parsing logs of the different components. Parsing using Fluentd, ingested into InfluxDB and consumed by Grafana. - Leveraging cta-admin command to check the status of the system, automated with rundeck. #### Time for Live Monitoring overview! ## Use Cases ### User/Repack stepping on each others foot Main dashboard shows an arcchive resquest is not progressing, no mounts, despite age > 4h and the volume is big enough to trigger the mount. ![](https://codimd.web.cern.ch/uploads/upload_4bb7e093d16eab2c0793e76be2cd9786.png) Show queues monitoring: ![](https://codimd.web.cern.ch/uploads/upload_868cdfe581349d3281910dbef017d852.png) NA62 setup: - The tapepool r_na62_1 and r_na62_2 are a dual copy setup. - NOTE: We don't report as succeeded the archival until the file reach both tape pools -> this can be problematic if an experiment is flushing to tape during data taking! - 4 drives for writing - 2 partial tapes for each tapepool What is going on? Not available tapes to write? Let's list non full tape: ``` [root@ctaproductionfrontend02 ~]# cta-admin tape ls -t r_na62_1 -f false vid media type vendor library tapepool vo encryption key name capacity occupancy last fseq full from castor state state reason label drive label time last w drive last w time w mounts last r drive last r time r mounts c.user c.host c.time m.user m.host m.time comment I50244 3592JD15T IBM IBM455 r_na62_1 NA62 - 15.0T 13.6T 23955 false true ACTIVE - CASTOR 1970-01-01 01:00 tpsrv009 2021-02-18 09:20 41 I4550833 2022-09-27 00:31 34 CASTOR CASTOR 2021-03-17 17:22 tapeops ctaproductionfrontend02 2022-09-27 03:00 - I55959 3592JD15T FUJIFILM IBM360 r_na62_1 NA62 - 15.0T 20.9T 15530 false true ACTIVE - CASTOR 1970-01-01 01:00 tpsrv031 2021-02-04 10:35 38 I3600524 2022-04-30 03:37 41 CASTOR CASTOR 2021-03-17 17:22 vlado ctaproductionfrontend02 2022-05-19 11:30 - I75647 3592JE20T FUJIFILM (DAI/9669750) IBM360 r_na62_1 NA62 - 20.0T 24.0T 8433 false false ACTIVE - I3600921 2023-05-23 14:19 I3600542 2023-07-04 11:48 2 - - 0 vlado ctaproductionfrontend02 2023-05-22 12:29 tape-local ctaproductionfrontend02 2023-07-03 15:33 - I76492 3592JE20T FUJIFILM (DAI/9669750) IBM360 r_na62_1 NA62 - 20.0T 9.0T 4016 false false ACTIVE - I3600922 2023-06-06 20:03 I3601413 2023-07-04 11:49 2 - - 0 vlado ctaproductionfrontend02 2023-05-22 12:30 tape-local ctaproductionfrontend02 2023-07-04 03:33 - ``` As we can see we have 2 non full tapes we can write to. To how many tapes should we be able to write to simultaneously : ``` [root@ctaproductionfrontend02 ~]# cta-admin tp ls name vo #tapes #partial #phys files size used avail use% encrypt supply c.user c.host c.time m.user m.host m.time comment r_na62_1 NA62 428 2 7581099 8.1P 12.0P 0 147.5% false supply_IBM513 CASTOR CASTOR 2021-03-17 17:22 tapeops ctaproductionfrontend02 2023-07-05 10:12 1st copy of the NA62 dual copy data ``` Let's check the queues: ``` [root@ctaproductionfrontend02 ~]# cta-admin sq | grep -e vo -e na62 type tapepool vo library vid files queued data queued oldest youngest priority min age read max drives write max drives cur. mounts cur. files cur. data tapes capacity files on tapes data on tapes full tapes writable tapes ArchiveForUser r_na62_1 NA62 - - 1962 6.5T 22299 2 250 14400 4 4 0 0 0 7.1P 6860862 10.5P 527 4 ArchiveForUser r_na62_2 NA62 - - 33 74.9G 406 2 250 14400 4 4 1 0 0 7.3P 6884792 10.5P 629 3 ArchiveForRepack r_na62_1 NA62 - - 8019 19.3T 29606 74 50 14400 4 4 2 7914 14.9T 7.1P 6860862 10.5P 527 4 ``` There is a repack ongoing with 2 mounts for archive on the `r_n62_1` pool. The archive request has a higher priority than the repack request BUT! the archive request arrived while the repack request is ongoing, Solution: stop ongoing repacks. Alternatively change the number of partial tapes, but this have side effects on data spread. ##### Related - [Tape Supply Mechanism Tool](https://gitlab.cern.ch/cta/cta-operations-utilities/-/tree/master/tools/pip/poolsupply?ref_type=heads) - [Tape Lifecycle, Slides, CTA Workshop 2023](https://indico.cern.ch/event/1227241/contributions/5335996/attachments/2634308/4559795/EOS_workshop_2023___Tape_lifecycle%20(6).pdf) - [Scheduling System](https://eoscta.docs.cern.ch/scheduling/cta-scheduling/) NOTE: In the past repack would use VO read/write drive quota which could also generate similar issues. Fixed with new dedicated repack VO. ### Forgotten staged files :'( DUNE experiment requested a retrieve for 50TB. ![](https://codimd.web.cern.ch/uploads/upload_c84da620193af89cd3888cdb8bac727e.png) User requested a copy from different endpoints through RUCIO, the EOS spaces got full and RUCIO fetched the reamining files from a different place. ![](https://codimd.web.cern.ch/uploads/upload_9c47fa89362143eba1b81522894c87e5.png) This is a shared instance and other users requests were delayed until all the requests were consumed: ![](https://codimd.web.cern.ch/uploads/upload_f1a919fe7094e5677222426724880682.png) ### Missconfigured FSTs Sometimes, for specific circumstances, i.e., experiment needs to flush/retrieve huge volumes of data in a small period of time, we change the setup and modify the parameters of the instances. When the situation is over operators might forget to reconfigure back to normal some of the previously modified parameters. During summer 2023 ALICE neeeded to flush ~150TB of data, increasing the number of drives was not improving the situation... ![](https://codimd.web.cern.ch/uploads/upload_950bb0a6fe16ad6cb33609fbb6597fa1.png) It turns out that 3 FSTs were not properly configured for writing. ### ObjectStore problems One day checking the monitorign we started seeing: ![](https://codimd.web.cern.ch/uploads/upload_48bddd48f0173c9df4a9ba606c300b52.png) ``` [1689669841.820067000] Jul 18 10:44:01.820067 tpsrv013.cern.ch cta-taped: LVL="ERROR" PID="22056" TID="22056" MSG="Failed to getFilesToRecall" thread="MainThread" tapeDrive="I4550843" tapeVid="I59840" mountId="876080" transactionId="876080" requestedBytes="1125899906842624" requestedFiles="4000" message="In BackendRados::read(): considering empty object (name=RetrieveQueueShard-I59840-DriveProcess-I4550844-tpsrv012.cern.ch-19119-20230716-16:22:43-0-16) as non-existing." [1689669841.820557000] Jul 18 10:44:01.820557 tpsrv013.cern.ch cta-taped: LVL="WARN" PID="22056" TID="22056" MSG="Aborting recall mount startup: empty mount" thread="MainThread" tapeDrive="I4550843" tapeVid="I59840" mountId="876080" [1689669841.820822000] Jul 18 10:44:01.820822 tpsrv013.cern.ch cta-taped: LVL="WARN" PID="22056" TID="22056" MSG="Notified client of end session with error" thread="MainThread" tapeDrive="I4550843" tapeVid="I59840" mountId="876080" errorMessage="Aborted: empty recall mount" MountTransactionId="876080" [1689669841.845654000] Jul 18 10:44:01.845654 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="22056" TID="22056" MSG="RecallMemoryManager destruction : all memory blocks have been deleted" thread="MainThread" tapeDrive="I4550843" tapeVid="I59840" mountId="876080" [1689669841.850316000] Jul 18 10:44:01.850316 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="22056" TID="22056" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." agentObject="DriveProcess-I4550843-tpsrv013.cern.ch-22056-20230718-10:43:48-0" [1689669842.059235000] Jul 18 10:44:02.059235 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="In signal handler, received SIGCHLD and propagations to other handlers" signal="Child exited" senderPID="22056" senderUID="0" [1689669842.059686000] Jul 18 10:44:02.059686 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Handler received SIGCHILD. Propagating to all handlers." SubprocessName="signalHandler" [1689669842.060606000] Jul 18 10:44:02.060606 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Propagated SIGCHILD." SubprocessName="signalHandler" [1689669842.060917000] Jul 18 10:44:02.060917 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Drive subprocess exited. Will spawn a new one." tapeDrive="I4550843" pid="22056" exitCode="0" [1689669842.061137000] Jul 18 10:44:02.061137 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Tape session finished" wasTapeMounted="0" mountTime="0.000000" positionTime="0.000000" waitInstructionsTime="0.000000" waitFreeMemoryTime="0.000000" waitDataTime="0.000000" waitReportingTime="0.000000" checksumingTime="0.000000" readWriteTime="0.000000" flushTime="0.000000" unloadTime="0.000000" unmountTime="0.000000" encryptionControlTime="0.000000" transferTime="0.000000" totalTime="0.043944" deliveryTime="0.043944" drainingTime="0.000000" dataVolume="0" filesCount="0" headerVolume="0" payloadTransferSpeedMBps="0.000000" driveTransferSpeedMBps="0.000000" repackFilesCount="0" userFilesCount="0" verifiedFilesCount="0" repackBytesCount="0" userBytesCount="0" verifiedBytesCount="0" errorMessage="Aborted: empty recall mount" mountId="876080" mountType="RETRIEVE" tapeDrive="I4550843" status="failure" [1689669842.062247000] Jul 18 10:44:02.062247 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Propagated SIGCHILD." SubprocessName="drive:I4550843" [1689669842.062534000] Jul 18 10:44:02.062534 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Propagated SIGCHILD." SubprocessName="maintenanceHandler" [1689669842.062744000] Jul 18 10:44:02.062744 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Subprocess handler requested forking" SubprocessName="drive:I4550843" [1689669842.062950000] Jul 18 10:44:02.062950 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="24394" TID="24394" MSG="Subprocess handler will fork" SubprocessName="drive:I4550843" [1689669842.063157000] Jul 18 10:44:02.063157 tpsrv013.cern.ch cta-taped: LVL="INFO" PID="22137" TID="22137" MSG="In child process. Running child." SubprocessName="drive:I4550843" ``` Error on multiple tape servers: ``` [1689670403.932734000] Jul 18 10:53:23.932734 tpsrv013.cern.ch cta-taped: LVL="ERROR" PID="24618" TID="24618" MSG="Failed to getFilesToRecall" thread="MainThread" tapeDrive="I4550843" tapeVid="I59840" mountId="876229" transactionId="876229" requestedBytes="1125899906842624" requestedFiles="4000" message="In BackendRados::read(): considering empty object (name=RetrieveQueueShard-I59840-DriveProcess-I4550844-tpsrv012.cern.ch-19119-20230716-16:22:43-0-16) as non-existing." ``` The tape specified in the ObjectStore reference exists: ``` [root@ctaproductionfrontend02 ~]# cta-admin sq | sed '1p;/I59840/!d' type tapepool vo library vid files queued data queued oldest youngest priority min age read max drives write max drives cur. mounts cur. files cur. data tapes capacity files on tapes data on tapes full tapes writable tapes Retrieve r_atlas_raw ATLAS IBM455 I59840 4 8.6G 155019 155019 150 14400 10 24 0 0 0 15.0T 9693 15.5T 1 0 ``` There is a queue object for that tape and it points to a shard: ``` [root@ctaproductionfrontend02 ~]# cta-objectstore-dump-object RetrieveQueueToTransferForUser-I59840-DriveProcess-I4550844-tpsrv012.cern.ch-19119-20230716-16:22:43-0-15 | sed -n '/Body dump:/,$p' | sed '1d' | jq '.retrievequeueshards' [ { "address": "RetrieveQueueShard-I59840-DriveProcess-I4550844-tpsrv012.cern.ch-19119-20230716-16:22:43-0-16", "shardjobscount": "4", "shardbytescount": "8569007976", "minfseq": "26", "maxfseq": "285" } ] ``` The shard does not exist: ``` [root@ctaproductionfrontend02 ~]# cta-objectstore-dump-object RetrieveQueueShard-I59840-DriveProcess-I4550844-tpsrv012.cern.ch-19119-20230716-16:22:43-0-16 | sed -n '/Body dump:/,$p' | sed '1d' | jq '.' Failed to dump object: In BackendRados::read(): considering empty object (name=RetrieveQueueShard-I59840-DriveProcess-I4550844-tpsrv012.cern.ch-19119-20230716-16:22:43-0-16) as non-existing. /lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7f55ab3ea84b] /lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::string const&, bool)+0x84) [0x7f55ab3ebdf4] /lib64/libctaobjectstore.so.0(cta::exception::NoSuchObject::NoSuchObject(std::string const&)+0x37) [0x7f55aac7c6eb] /lib64/libctaobjectstore.so.0(cta::objectstore::BackendRados::read(std::string const&)+0x1eb) [0x7f55aac828d1] cta-objectstore-dump-object(cta::objectstore::ObjectOps<cta::objectstore::serializers::GenericObject, (cta::objectstore::serializers::ObjectType)1000>::getHeaderFromObjectStore()+0x54) [0x413a78] cta-objectstore-dump-object(cta::objectstore::ObjectOps<cta::objectstore::serializers::GenericObject, (cta::objectstore::serializers::ObjectType)1000>::fetchBottomHalf()+0x2f) [0x413879] cta-objectstore-dump-object() [0x413259] cta-objectstore-dump-object() [0x40ed9e] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f55a8376555] cta-objectstore-dump-object() [0x40e619] ``` Cause: problem when a tapeserver tried to remove the agent, failed to read priority from object but it was ok in the object store. Solution: delete the queue and dereference it from the index using `cta-objectstore-dereference-removed-queues`.