Site Reliability Engineer

Dunstable, District Of Columbia

  Other Area(s)

Permanent

Our client is a Series A startup within the Generative AI space and they are hiring an Site Reliability Engineer to join the team. Backed by one of the leading venture capital firms in the industry, this is an exciting opportunity to join a SaaS company that is revolutionizing their industry. 

Responsibilities:

  • As the Site Reliability Engineer, you will perform root cause analysis to identify and resolve system or application issues in a timely and effective manner

  • You will design and implement a broad range of automated tests to ensure system reliability and performance

  • Building scalable and cost-effective observability patterns in Datadog or other monitoring providers

  • Monitor and analyze SLIs to ensure adherence to SLAs and SLOs

  • Collaborate with development and operations teams to improve system reliability and developer experience

  • Develop and maintain monitoring and alerting systems to proactively address issues

  • Implement best practices for incident management and disaster recovery

  • Plan and implement capacity upgrades, ensuring scalability and performance

  • Define, monitor, and manage SLAs, ensuring service levels meet or exceed expectations

  • Ensure systems comply with security and regulatory requirements

Skillset:

  • Experienced in Kubernetes and Helm

  • Expertise in observability and monitoring tools such as Prometheus, Grafana, Datadog or Elk

  • Experience in Azure cloud

  • Strong understanding of microservices architecture, including Postgres and AI systems.

  • Expertise in automated testing frameworks and tools

  • Experience with monitoring and analytics tools to track SLIs, SLAs, and SLOs

  • Excellent problem-solving skills and attention to detail. Tenacious attitude

  • Proficiency in programming languages such as TypeScript and Python

  • Strong scripting skills in Bash, PowerShell, or similar

  • Understanding of networking principles and experience with network troubleshooting

This is a full time, remote position and is only open to US Citizens due to potential security clearance requirements.

Benefits:

  • Salary: $140k – $175k

  • Stock options

  • Benefits package

Interested? Apply now in the link below or email your resume directly to matthew@alldus.com for consideration.

44985