Staff Platform Engineer

California

Development - Java Backend

Permanent

Our client, a growing AI-driven startup, is hiring a a Staff Platform Engineer to join the team in California. The successful candidate will design and scale their core infrastructure, driving projects that improve system reliability, boost developer efficiency and elevate operational standards within their talented engineering team.

Responsibilities

Design and build scalable backend systems that support complex data labeling workflows, enable real-time collaboration, and incorporate large language model (LLM) capabilities.
Lead infrastructure initiatives focused on deployment, observability and performance optimization within a cloud-native environment.
Develop secure, modular APIs and services that seamlessly integrate with both internal tools and customer-facing applications.
Work closely with engineers, product managers and data operations teams to advance the company’s platform features and capabilities.
Establish and maintain observability best practices – including logging, metrics, tracing and health monitoring – to deliver an exceptional developer experience.
Guide architectural decisions involving service decomposition, asynchronous processing, scaling strategies and high-availability systems.
Take ownership of the reliability and performance of critical services, including distributed task routing, model integrations and data pipelines.
Mentor engineering team members in creating scalable, maintainable systems using contemporary engineering best practices.

Skillset

Minimum of 8 years’ experience in backend or platform engineering, preferably within fast-growing SaaS or AI infrastructure companies.
Proficient in Python programming, with hands-on experience using frameworks such as FastAPI.
Strong expertise in cloud platforms (AWS, GCP or Azure), including container orchestration technologies like Docker and Kubernetes, and Infrastructure as Code tools such as Terraform.
Experienced with SQL databases (e.g. PostgreSQL) and event-driven messaging systems (e.g. Kafka, RabbitMQ, Redis Streams).
Well-versed in CI/CD pipelines, automated testing and deployment methodologies for large-scale production environments.
Demonstrated ability to troubleshoot complex distributed systems, identify performance bottlenecks and build robust, resilient architectures.
Comfortable working in early-stage, dynamic environments, taking initiative to identify and solve high-impact challenges.
Experience integrating large language models (LLMs), working with ML/AI infrastructure or contributing to data labeling platforms is a plus.