DevOps Engineer
other jobs Matchtech
Added before 2 Days
- England,London,City of London
- Full Time, Permanent
- £85,000 per annum, negotiable
Job Description:
DevOps Engineer - Reinforcement Learning Platforms
We are seeking an experienced DevOps Engineer to help build and scale a web-based platform for reinforcement learning (RL) training and RLOps. You will design, implement, and maintain the cloud infrastructure, CI/CD pipelines, and deployment systems that support large-scale RL workloads.
Responsibilities * Design and manage scalable cloud infrastructure for high-performance RL training and distributed environments
* Build and optimise CI/CD pipelines for open-source and enterprise components
* Implement containerisation and orchestration using Docker and Kubernetes
* Develop Infrastructure as Code solutions (Terraform, CloudFormation, Pulumi)
* Implement monitoring, logging, and alerting for distributed ML systems
* Collaborate with ML teams on resource optimisation and cost efficiency
* Apply security best practices, manage access controls, and ensure compliance
* Automate operational tasks: backups, disaster recovery, maintenance
* Support GPU clusters and distributed compute resources for RL workloads
* Maintain availability and performance of production ML systems
Requirements * Degree in Computer Science/Engineering or 3+ years of DevOps/infrastructure experience
* Strong background with AWS, GCP, or Azure, including ML/AI workloads
* Proficiency with Docker, Kubernetes, and ML-focused orchestration
* Experience with Terraform/CloudFormation/Pulumi and configuration management
* Solid understanding of CI/CD tools (GitHub Actions, GitLab CI, Jenkins)
* Knowledge of monitoring/observability tools (Prometheus, Grafana, OpenObserve)
* Experience with GPU infrastructure and distributed ML compute frameworks
* Familiarity with MLOps tools and model lifecycle management
* Strong scripting skills (Python, Bash)
* Understanding of cloud networking, security, and database fundamentals
* Experience with HPC environments or schedulers is a plus
* Strong problem-solving and communication skills
Compensation & Benefits * Stock options
* 30 days’ holiday plus bank holidays
* Flexible and remote working options
* Enhanced parental leave
* £500 annual learning and development budget
* Pension scheme
* Regular socials and quarterly gatherings
* Bike-to-Work scheme
We are seeking an experienced DevOps Engineer to help build and scale a web-based platform for reinforcement learning (RL) training and RLOps. You will design, implement, and maintain the cloud infrastructure, CI/CD pipelines, and deployment systems that support large-scale RL workloads.
Responsibilities * Design and manage scalable cloud infrastructure for high-performance RL training and distributed environments
* Build and optimise CI/CD pipelines for open-source and enterprise components
* Implement containerisation and orchestration using Docker and Kubernetes
* Develop Infrastructure as Code solutions (Terraform, CloudFormation, Pulumi)
* Implement monitoring, logging, and alerting for distributed ML systems
* Collaborate with ML teams on resource optimisation and cost efficiency
* Apply security best practices, manage access controls, and ensure compliance
* Automate operational tasks: backups, disaster recovery, maintenance
* Support GPU clusters and distributed compute resources for RL workloads
* Maintain availability and performance of production ML systems
Requirements * Degree in Computer Science/Engineering or 3+ years of DevOps/infrastructure experience
* Strong background with AWS, GCP, or Azure, including ML/AI workloads
* Proficiency with Docker, Kubernetes, and ML-focused orchestration
* Experience with Terraform/CloudFormation/Pulumi and configuration management
* Solid understanding of CI/CD tools (GitHub Actions, GitLab CI, Jenkins)
* Knowledge of monitoring/observability tools (Prometheus, Grafana, OpenObserve)
* Experience with GPU infrastructure and distributed ML compute frameworks
* Familiarity with MLOps tools and model lifecycle management
* Strong scripting skills (Python, Bash)
* Understanding of cloud networking, security, and database fundamentals
* Experience with HPC environments or schedulers is a plus
* Strong problem-solving and communication skills
Compensation & Benefits * Stock options
* 30 days’ holiday plus bank holidays
* Flexible and remote working options
* Enhanced parental leave
* £500 annual learning and development budget
* Pension scheme
* Regular socials and quarterly gatherings
* Bike-to-Work scheme
Job number 3193729
Increase your exposure to recruiters with ProJobs
Thousands of recruiters are looking for you in the Job Master profile database, increase your exposure 4 times with a ProJob subscription
You can cancel your subscription at any time.
metapel
Company Details:
Matchtech
Company size: 250–499 employees
Industry: Other
We’re motivated by our mission to bridge the STEM skills gapSince our doors opened in 1984, Matchtech has grown to become one of the UK’s ...