AI Infrastructure Engineer
Our client, an early-stage, AI-driven startup in the defense industry, is hiring an AI Infrastructure Engineer to join their team in California. The successful candidate will design and scale the foundation of their model training and deployment ecosystem to enable their vision-language-action models to learn from massive real-world datasets and operate seamlessly across both edge and cloud environments.
Responsibilities
-
Design and implement pipelines to ingest, transform and store petabytes of multimodal data from their robotic and operator systems.
-
Develop tools for dataset exploration, curation, versioning and quality monitoring.
-
Build and maintain distributed training infrastructure for large-scale multimodal and foundation model training, both in the cloud and on-premises.
-
Implement orchestration workflows to launch, track and debug large-scale model runs.
-
Identify and resolve bottlenecks in compute, memory, storage and network performance.
-
Collaborate with AI, autonomy and systems teams to support real-time and mission-critical applications.
-
Maintain observability and reliability tools for training and inference pipelines.
-
Stay up to date with best practices in MLOps, distributed training frameworks and AI infrastructure at scale.
Skillset
-
Bachelor’s degree or higher in Computer Science, Electrical Engineering or a related technical field.
-
Minimum of 3 years of experience in ML infrastructure, MLOps or large-scale data systems.
-
Proven experience with distributed training frameworks (e.g. PyTorch DDP, DeepSpeed, Ray) and workflow orchestration tools (e.g. Kubernetes, Airflow, or equivalents).
-
Strong proficiency in Python and hands-on experience with cloud-native infrastructure (AWS, GCP or Azure).
-
Solid understanding of data engineering concepts, including ETL pipelines, object storage, data versioning and metadata management.
-
Familiarity with containerization technologies (Docker, Kubernetes) and monitoring systems (Prometheus, Grafana).
-
Experience optimizing GPU cluster utilization, scaling training jobs and profiling model performance.
-
Experience with edge-deployed ML systems, federated training or robotic data collection pipelines is a plus.
-
Must have legal authorization to work in the U.S.; certain responsibilities may involve access to export-controlled information.
Benefits
-
Salary: $160K – $220K DOE. Exceptional candidates may be considered for higher compensation.
-
Performance Bonus.
-
Equity.
-
Medical, dental and vision insurance.
56740
SHARE JOB