Learn what hiring & work looks like today in our new survey report on AI @ Work: Download the report now

Site Reliability Engineer job description

A Site Reliability Engineer is a professional who acts as a bridge between development and IT operations, taking on operational tasks to ensure the efficient functioning of computer systems. They are responsible for monitoring, automating, and improving the reliability, performance, and availability of software systems.

Use this Site Reliability Engineer job description to advertise your vacancies and find qualified candidates. Feel free to modify responsibilities and requirements based on your needs.

What is a Site Reliability Engineer?

A Site Reliability Engineer is a professional who plays a crucial role in maintaining the reliability and performance of computer systems in an organization. They bridge the gap between development and IT operations by taking on operational tasks and responsibilities typically handled by operations teams.

What does a Site Reliability Engineer do?

A Site Reliability Engineer is responsible for monitoring, automating, and improving the reliability, performance, and availability of software systems in an organization. They work on tasks such as preventing incidents, managing infrastructure, building effective monitoring systems, and ensuring the smooth operation of computer systems.

Site Reliability Engineer responsibilities include:

  • Working on-call shift to prevent incidents from ever happening
  • Running our infrastructure with Chef, Ansible, Terraform, GitLab CI/CD, and Kubernetes
  • Building monitoring that alerts on symptoms rather than on outages

Job brief

We are looking for a Site Reliability Engineer to join our team and develop software systems and automated solutions for operational aspects in an organization. 

Site Reliability Engineer responsibilities include monitoring computer systems and building alerts for various operational issues that computer systems can experience. 

Ultimately, you will work with our IT team to ensure our organization can continue to deliver products and services in our computer system environment. 

Responsibilities

  • Administer production jobs
  • Understand debugging info
  • “Drain” traffic away from a cluster
  • Roll back a bad software push
  • Block or rate-limiting unwanted traffic
  • Bring up additional serving capacity
  • Use the monitoring systems (for alerting and dashboards)

Requirements and skills

  • Proven work experience as a Site Reliability Engineer or similar role
  • Collaborate and communicate asynchronously
  • Document all the things so you don’t need to learn the same thing twice
  • Have an enthusiastic, go-for-it attitude
  • Relevant training and/or certifications as a Site Reliability Engineer

Frequently asked questions

Jump to section