G

SRE Observability SLO Engineer

GE Vernova

🌍 Anywhere 🏠 Remote ⏱ Part-time 💼 Mid-level 🗓 4 days ago

Job Description Summary GE Vernova's GridOS Platform Engineering team is building the next generation of SaaS reliability for critical energy infrastructure.The Observability & SLO Engineer is the eyes and ears of the GridOS SRE team. In this role you will build and own the full telemetry stack — from instrumentation standards to SLO dashboards to synthetic monitors — that give GE Vernova and its utility customers real-time confidence in the reliability of mission-critical energy management systems. This is a cyclical, high-impact position: you will drive an intensive initial ramp to establish v1.0 observability coverage across all customer environments, then shift into an ongoing improvement cadence aligned to new product releases and customer onboarding. Job Description Roles and Responsibilities Telemetry Standards & Architecture Implement organization-wide telemetry standards covering metrics, logs, and distributed traces across all GridOS SaaS services. Implement metrics collection for Kubernetes-hosted services (EKS/Rancher) including pod-level, namespace-level, and cluster-level metrics. Working with the SRE Lead and SRE Platform Engineers help define and implement data retention policies, cardinality budgets, and telemetry cost controls to keep observability economically sustainable. Publish and maintain an Observability Runbook library covering onboarding, alert tuning, and dashboard standards for Platform SRE and Production DevOps teams. SLO Definition, Tooling & Governance Partner with product engineering, Platform SRE, and customer stakeholders to define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) per product and customer tier. Build and maintain SLO tooling — error budget burn-rate alerts, burn-rate dashboards, and automated SLO compliance reports. Govern the SLO review cycle: facilitate monthly SLO reviews, identify reliability risks early, and drive prioritization of reliability work with the SRE Lead. Translate SLOs...

Share this job: