Senior Service Engineer - Livesite Manager - CTJ - Poly
Microsoft | |
United States, Virginia, Reston | |
Oct 22, 2025 | |
|
OverviewDo you want to work on the cutting edge of distributed systems and high scale storage? Do you want to work on a meaningful and impactful project and make a difference to the U.S. government and country? Azure Storage for air-gapped clouds (AGC) is a foundational part of Azure and is entrusted with storing exabytes of data for everything from the virtual hard disks that back Azure Virtual Machines to customer blobs to SQL Server databases to OneDrive content, all while providing industry leading availability and durability for that data.We are recruiting for a Senior Service Engineer - Livesite Operations. This individual will lead critical production operations for Azure Storage Core services, driving observability, automation, and engineer readiness to ensure operational reliability and resilience at hyperscale.What is Livesite?At Microsoft, Livesite refers to a customer-first, production-focused mindset and set of practices aimed at keeping services always up and healthy. It includes incident management, proactive monitoring, automation, and continuous improvement to minimize customer impact and reduce Time-to-Mitigation (TTM) for critical issues. Livesite work spans the entire lifecycle:* Pre-Incident: Monitoring, alerting, and preventive measures.* During Incident: Rapid mitigation and communication under high-pressure conditions.* Post-Incident: Retrospectives, repair tracking, and amplifying learnings to prevent recurrence.Why Join Us?You'll play a pivotal role in ensuring the reliability of Azure Storage services that power mission-critical workloads, including premium AI scenarios. Your work will directly impact customer trust and accelerate innovation across Microsoft's cloud platform.Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesLead tracking and resolution of incidents, drive improvements for time-to-mitigation and operational reliability metrics, drive repair items.Manage Incident Manager (IM) and Core rotations, including onboarding, livesite support processes, and readiness for new on-call engineers.Develop and, coordinate training programs to ensure engineer preparedness for livesite responsibilities.Define and implement observability standards and practices for Core services.Partner with engineering teams to enhance telemetry, alerting, AI automation, and dashboards to reduce manual overhead and improve operational efficiencyDrive parity efforts to align with security and compliance standards.Work with engineering leads to develop support plans for new services onboarded into the AGCEmbody our culture and values | |
Oct 22, 2025