Staff ML Infrastructure Engineer
San Francisco
USA
$300000.00/year
Permanent
Articificial Intelligence
Location: United States (West Coast preferred, remote considered)
About the Company
We are a fast-growing AI company building next-generation large language models at scale. Our mission is to bring powerful, reliable AI systems into production environments used by thousands of customers. We value technical excellence, deep collaboration, and engineers who thrive on solving real-world problems at scale.
Role Overview
We are seeking a Staff / Principal ML Infrastructure Engineer to lead the design, deployment, and scaling of our large language model infrastructure. This role sits at the intersection of machine learning, systems engineering, and platform design, enabling teams to train, serve, and monitor models efficiently and reliably.
This is not a prompt engineering role – it is focused on building robust, production-grade ML infrastructure and operational pipelines.
Responsibilities
- Design, implement, and maintain high-performance infrastructure for training and serving LLMs
- Optimize model pipelines for efficiency, latency, and cost at scale
- Collaborate with ML researchers, platform engineers, and product teams to deploy models safely into production
- Build monitoring, alerting, and tooling to ensure reliability and observability of large-scale ML systems
- Evaluate and integrate new frameworks, tools, and architectures to improve ML workflows
- Provide technical leadership and mentorship to other engineers on the team
Qualifications
- 7+ years of software engineering experience, including 3+ years building production ML systems
- Deep experience with distributed training and inference frameworks (e.g., PyTorch, JAX, TensorFlow)
- Familiarity with model serving technologies and orchestration (e.g., Triton, Ray, Kubernetes)
- Strong understanding of GPU/TPU infrastructure, performance optimization, and scalability challenges
- Proven experience solving reliability, latency, and cost trade-offs in production ML systems
- Excellent collaboration, communication, and problem-solving skills
- Experience mentoring or leading engineering teams is a plus
Why You’ll Enjoy This Role
- Work on cutting-edge LLM infrastructure at scale
- Influence the design of systems that power real-world AI applications
- Collaborate with some of the most talented engineers in AI
- Flexible work arrangements and competitive compensation
Darwin Recruitment is acting as an Employment Agency in relation to this vacancy.
Reece Waldon
To Apply for this Job Click Here
Submit Your CV
Similar Jobs
1
Contract
Memphis – BAS ProgrammerEngineering
Other
Job Title: BAS Programmer (Building Automation Systems) Location: Memphis, TN Contract Type: Short-term Contract Start Date: ImmediateVehicle: Company vehicle may be available (safety training See more…
to $200/year
Memphis
USA
1
Contract
BAS InstallerEngineering
Other
Job Title: BAS Programmer (Building Automation Systems) Location: Memphis, TN Contract Type: Short-term Contract Start Date: Immediate Vehicle: Company vehicle may be available (safety See more…
to $200/year
Memphis
USA
1
Contract
Frontend Magento EngineerTechnology
Software Development
Senior Magneto Developer / Remote / E-Commerce Industry / Freelance Job title – Senior Magento Developer Client Location – Remote Remote work offering – See more…
to ā¬50.00/hour
Bucharest
Romania