N

Manager, Site Reliability Engineer - DGX Cloud

NVIDIA

🌍 Asia 🏠 Remote ⏱ Part-time 💼 Manager 🗓 4 weeks ago

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world. At NVIDIA, we are building the future of AI and high-performance computing, and our cloud platforms are at the core of this transformation. As a Senior Manager of SRE, you will play a pivotal role in shaping the operational excellence of these critical services, leading a talented team of SREs to build robust systems, automate operations, and drive a culture of continuous improvement. What you'll be doing: Recruit, develop, and inspire a team of Site Reliability Engineers, fostering a strong culture of collaboration, ownership, and technical excellence. Provide mentorship, guidance, and career development opportunities to help your team grow. Establish and enforce SRE standard practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, and robust incident management processes. Drive continuous improvement in system reliability, availability, and performance. Collaborate closely with engineering and product teams to design, build, and deploy highly scalable, fault-tolerant, and performant cloud services. Champion architecture reviews and ensure operational considerations are embedded from inception. Lead initiatives to eliminate toil by driving automation across the entire service lifecycle, including provisioning, deployment, monitoring, inci...

Share this job: