Nvidia
Site Reliability Engineer - Hardware Infrastructure
Found: January 16, 2026
This role is based in Santa Clara, CA.
Compensation:
$168,000 - $333,500/year
Responsibilities:
- Develop and support guidelines for incident management, planned maintenance, and blameless postmortems.
- Assist teams in responding to high severity incidents and driving root cause analysis.
- Define reliability and supportability metrics, Service Level Objectives, and error budgets.
- Apply automation and Generative AI/Agentic solutions to enhance customer support.
- Guide teams on establishing sustainable on-call and operational standards.
Requirements:
- Degree in Computer Science or related field, or equivalent experience.
- 8+ years of experience in SRE, DevOps, or Production Engineering.
- Strong understanding of SRE principles and experience with fault-tolerant systems.
- Experience in Python, Go, Perl, or Ruby.
- Hands-on experience with observability platforms like Prometheus, Grafana.