Overview
This contract role focuses on building and operating the infrastructure that supports production machine learning systems at scale. You’ll work on the platforms that allow ML and LLM-powered services to run reliably in live environments, ensuring performance, stability, and smooth deployment into existing products. This is a hands-on engineering role centred on production systems, not research or experimentation.
Responsibilities
• Design and operate production ML platforms supporting live applications
• Build and maintain Kubernetes-based environments for model inference and pipelines
• Support language-model workloads including retrieval-based systems and model serving
• Implement and maintain CI/CD pipelines for ML deployments
• Monitor platform and model performance including latency, drift, and resource usage
• Work closely with engineering and product teams to support delivery
• Ensure systems meet reliability, security, and operational standards
Required experience
• Strong Python experience in production environments
• Hands-on experience running ML workloads on Kubernetes
• Ownership of ML systems from deployment through monitoring and maintenance
• CI/CD experience for data or ML platforms
• Observability tooling experience (Prometheus, Grafana, or similar)
• Cloud platform experience (AWS or Azure)
Nice to have
• Experience supporting LLM-based systems in production
• GPU-backed workloads or performance optimisation
• Infrastructure-as-code exposure (Terraform or similar)