Nvidia
Senior Platform and EngOps Engineer - Cluster Operations
Found: Today
This role is based in Santa Clara, CA.
Compensation:
Salary range is $176,000 - $333,500/year depending on level.
Responsibilities:
- Develop automated tools for deploying and maintaining GPU clusters.
- Implement DevOps tools for software updates and cluster monitoring.
- Troubleshoot daily cluster issues to ensure optimal performance.
- Manage software and firmware updates for clusters.
- Collaborate with engineering teams across time zones.
Requirements:
- BS or MS in Computer Science or related field, or equivalent experience.
- 8+ years of experience in cluster administration.
- Expertise in Ansible, Python, and Shell Scripting.
- Deep understanding of operating systems and computer networks.
- Proficient in Linux fundamentals.
Preferred Qualifications:
- Familiarity with resource scheduling managers like Slurm.
- Experience with GPU-focused hardware and software.