Job Overview
About the role
Own the platform that powers our accelerator cloud. Your scope spans bare-metal provisioning, multi-tenant Kubernetes, SLURM scheduling, control planes, and the automation and observability that keep thousands of compute nodes running as a single production system.
What you'll do
• Build the control plane and APIs that unify our compute fleet
• Own provisioning and lifecycle from rack bring-up to node retirement
• Operate the scheduling layer for training and inference workloads
• Architect multi-tenancy: isolation, quota, fairness, and accounting
• Build automation that eliminates manual operations
• Drive reliability, observability, and incident response across the fleet
What you'll need
• BS in CS, EE, or related field, or equivalent experience
• 5+ years in infrastructure, platform, or backend engineering
• Advanced software engineering skills: Rust, Go, or Python
• Deep understanding of Linux, storage, and distributed systems
• Experience with workload schedulers: SLURM, Kubernetes scheduling, or equivalent
• Expertise with automation tooling: Terraform, Ansible, Helm
• Experience architecting multi-tenant systems
• Production SRE experience: on-call, incident response, observability
What we offer
• Top-tier compensation structured to recognize and retain the best talent
• Meaningful equity
• Comprehensive medical, dental, vision, life, and disability insurance
• Parental leave for all new parents, including adoptive and surrogate journeys
• Flexible PTO
• Paid Holidays
• Relocation support
Equal Employment Opportunity
We're an Equal Opportunity Employer and do not discriminate on the basis of any protected status under applicable law.