Postmortem OTG0153169 - CERNTS connection issues

# Postmortem [OTG0153169](https://cern.service-now.com/service-portal?id=outage&n=OTG0153169) - CERNTS connection issues ## Short description of the incident **Date**: 16/11/2024 **Authors**: Mario Rey Regulez **Status**: Complete **Summary**: CERNTS connection issues on Saturday morning, November 16. **Impact**: - Summary of impact: Since Friday 15th afternoon new connections to CERNTS could be failing - Number of affected users: based on historic data ~40 users were affected (out of 2500 CERNTS users) - Length of time users were affected: 20h (highly visible for 1h) - What functionality was affected: connection to CERNTS public Terminal Services cluster ## Detailed description of the root cause of the incident: - A few CERNTS nodes were lost during the November 2024 monthly test-patching ([OTG0153018](https://cern.service-now.com/service-portal?id=outage&n=OTG0153018)) and debugging from Wed-Fri prior. The service was running at a slightly reduced capacity but did not pose a risk. Some of the lost nodes were recreated. - On Friday noon ESET Server Security was found to be the root cause of the issue, and it was decided to remove it from all Windows servers as a preventive measure ([OTG0153165](https://cern.service-now.com/service-portal?id=outage&n=OTG0153165)). - To ensure antivirus protection on Windows Terminal Infrastructure, nodes required a reboot ([OTG0153166](https://cern.service-now.com/service-portal?id=outage&n=OTG0153166)) to enable back the previous antivirus software. - The recreated nodes were put back in service and were not fully configured (TLS certificate missing) because of a human oversight. - Half of the nodes were rebooted Saturday 3am to minimise user impact. Iddle sessions were closed. - When users connected back Saturday morning they were sent to unconfigured nodes, who rejected them. ### What were the origins of failure? An ESET Server Security Detection Engine update was bricking Windows Server 2016 nodes, and admins quickly acted to remove the product from all CERN servers to avoid a larger problem. A human oversight in the performed actions put not-fully configured CERNTS nodes in production, which rendered the service unusable. ### Why do we think this happened? Service manager working on avoiding a larger problem for 3 days missed a step in a procedure. Blind spot in our monitoring tools. ### Trigger: the initial trigger of the incident Larger problem with ESET on Windows Servers breaking them during monthly patching: [OTG0153165](https://cern.service-now.com/service-portal?id=outage&n=OTG0153165). ### Detection: short description of how the incident was detected Service manager, checking how things were outside working hours following the scheduled reboot, noticed reports from affected users at the "DownForEveryoneOrJustMe" channel. ### Steps taken to diagnose, assess, and resolve: #### What actions were taken? 1. Checked monitoring to find an obvious problem. 2. Enabled cernts-homeless.cern.ch cluster (used for Business Continuity, same configuration but with local, non-persistent user profiles) so users could have a workaround. Mentioned in OTG, received email and Mattermost. 3. Logged into the Remote Desktop Broker to check the status of the Remote Desktop deployment overall. 4. Checked the Microsoft SQL database status. 5. Logged into several nodes from different clusters to assess the impact. 6. Logged into all CERNTS nodes to identify the problematic ones and check general status (ESET removal, Windows Defender installation, services running, firewall status, RDP certificate...). 7. Removed from the Load Balancer those nodes that were identified to be problematic. 8. Rebooted saturated nodes that were attracting all the traffic - existing and new users were blocked on them. 9. Prepared and executed a configuration script to set the missing certificate on all CERNTS nodes. #### Which were effective? Action 2 for quickly setting an emergency workaround. Actions 3 and 5 to assess the impact. Actions 1 and 6 to diagnose. Actions 7, 8 and 9 to restore full CERNTS availability. #### Which were detrimental? None ### Timeline of activity, including communication From the immediate action that triggered the connection issue: Friday 2024/11/15 [12:19] Publication of https://cern.service-now.com/service-portal?id=outage&n=OTG0153166 to announce a TS-wide reboot to remove ESET and install Windows Defender. [12:21] Email sent to dedicated cluster administrators informing them about the scheduled reboots. [Afternoon] Orchestration of half of the TS park reboot for Saturday 2024/11/16 3am. [Afternoon] Those 8 nodes lost (out of 30) during test-patching were recreated and added back to production. [Afternoon] It's possible that some new connections were trying to reach the misconfigured nodes, but a re-try was likely to have worked. Impact was low as user sessions were already created. Friday 2024/11/16 [03:15] Half of CERNTS nodes were rebooted as scheduled, successfully removing ESET and reinstalling Windows Defender. Sessions were closed. [10:27] Service manager receives an email warning him personally about the issue. [10:43] Roberto Divia (ALICE) reports the issue on DownForEveryoneOrJustMe on Mattermost. [10:50] Mario Rey (Windows Terminal Services admin) checks for alarms; there were none. He connects to Mattermost and email and sees the user reports. He starts working on assessing and diagnosing. Publishes OTG, replies to Mattermost, replies to email. [11:10] `cernts-homeless` becomes available for all CERNTS users (previously it was only for `asdf-members` as agreed time ago with BCDR admins). [11:30] Unconfigured nodes are removed from production and stuck nodes rebooted. CERNTS becomes operational at half capacity. OTG closed (although still working on it). [11:50] Unconfigured nodes are correctly configured and validated. They are put back in production and CERNTS recovers normal capacity. ### What went well: - Pre-prepared BC cluster allowed users to regain access sooner. - Knowledge of the deployment and well-documented procedures made recovery quick. - Admin picked up Mattermost messages quickly. ### What went wrong: - Timing: ESET issue required an urgent operation during a critical patching session. - Taking preventative measures to avoid larger issue with ESET made us overlook incomplete configuration. - Blind spot in our monitoring: it was checking validity of certificates in the cert store, but not if it was correctly set for RDP. ### Where we got lucky: - ESET problem only affected devices after monthly patches and reboot. - It only affected Windows Server 2016 with Remote Desktop role, Internet connectivity and OpenStack virtualization. ### Resolution - Immediate workaround: `cernts-homeless` was enabled for all CERNTS users. - Short-term fix: removed problematic nodes from alias. - Long-term fix: re-configure problematic nodes. ### Follow up Action Items - Fix monitoring blind spot (detection). - Improve trigger fom load balancer expulsion (prevention). - Follow up tickets with Microsoft and ESET to understand the root cause [OTG0153165](https://cern.service-now.com/service-portal?id=outage&n=OTG0153165) - Update procedure to highlight testing newly added nodes (prevention). - Improve BC cluster visibility and isolation (mitigation). ### Supporting information: - [CERNTS usage during the incident](https://monit-grafana.cern.ch/d/dc9c5b21-deed-49ab-8472-0893fc853d34/windows-terminal-servers?orgId=83&from=1731625200000&to=1731884399000&viewPanel=11)