| Description: |
Senior DevOps Engineer - AI & Cloud Infrastructure Type: Permanent / Full-Time (Employment or Contract considered) Location: Remote or Hybrid Time Zones: UK, Europe, North America-friendly The OpportunityWe're working with a high-growth tech-start up company building a next-generation AI cloud platform, focused on fast, reliable inference for large language models and other compute-intensive workloads. The platform combines modern cloud infrastructure, Kubernetes, GPU clusters, and developer-first tooling to support mission-critical AI systems operating across multiple regions. They're now looking for a Senior DevOps Engineer to take ownership of the infrastructure backbone - someone who enjoys operating complex systems at scale and working closely with infrastructure, ML, and product engineering teams. What You'll Be DoingAI Cloud Infrastructure- Design, build, and operate highly available, secure infrastructure supporting AI inference, fine-tuning, and data processing workloads
- Manage multi-region Kubernetes clusters, including GPU-heavy environments
- Implement autoscaling strategies across heterogeneous compute fleets
Infrastructure as Code & Automation- Own and evolve infrastructure-as-code using tools such as Terraform, Helm, and similar
- Automate provisioning of compute, networking, and storage
- Build tooling to spin environments up and down for experiments, benchmarks, and customer deployments
CI/CD & Release Engineering- Design and maintain CI/CD pipelines across backend, infrastructure, and ML components
- Implement safe deployment strategies (e.g. blue/green, canary releases)
- Partner with engineers to improve build speed, test reliability, and deployment confidence
Observability, Reliability & SRE- Build and operate observability stacks (metrics, logging, tracing)
- Define and monitor SLOs / SLAs for latency, availability, and reliability
- Create runbooks, playbooks, and incident response processes for production systems
Security & Best Practices- Implement best practices around secrets management, access control, and network security
- Support secure, multi-tenant environments for enterprise customers
- Help foster a culture of operational excellence, ownership, and reliability
What They're Looking ForEssential- 4-8+ years' experience in DevOps, SRE, Platform, or Infrastructure Engineering
- Strong experience running production systems on major cloud platforms (AWS, GCP, or Azure)
- Deep hands-on experience with Kubernetes in production
- Strong Infrastructure-as-Code skills (Terraform or equivalent)
- Proficiency in at least one scripting or programming language (e.g. Python, Go, Bash)
- Solid understanding of networking, security fundamentals, and distributed systems
- Proven experience building reliable, observable, automated systems
Nice to Have- Experience supporting GPU-based workloads or ML infrastructure
- Exposure to AI / ML platforms, inference systems, or data pipelines
- Familiarity with modern CI/CD tooling and GitOps approaches
- Experience with observability tooling (metrics, logs, tracing)
- Background in cloud platforms, AI infrastructure, or high-scale SaaS environments
Why Join- Work on core infrastructure powering cutting-edge AI systems
- High impact and ownership over architecture and tooling decisions
- Collaboration with senior engineers across infrastructure, ML, and product
- Competitive compensation, equity, and long-term growth potential
- Flexible remote / hybrid working
 |