Job Description
We are seeking a hands-on Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems.
This is an SRE-heavy, infrastructure-first role, focused on ensuring AI systems operating in production are:
Reliable
Observable
Scalable
Secure
Cost-efficient
Safe to deploy and operate
You will play a critical role in building and maintaining the platform foundation that enables AI services to run safely and efficiently at scale.
Key Responsibilities
1. Infrastructure Provisioning & Automation
Design and manage cloud infrastructure using Infrastructure as Code (Terraform or similar)
Provision and maintain Kubernetes clusters ...