Author

Juan Andrés Zeni

Juan Andres is the SRE Studio Lead at Moove It, where he works in the design and implementation of fault-tolerant, cost-efficient, and resilient architectures in the cloud for agile software projects.

One of the fastest growing technology roles today is Site Reliability Engineering (SRE). According to LinkedIn, SRE experts are currently ranked #21 on the list of the fastest growing U.S. jobs. This demand reflects the increasing need of organizations for highly qualified individuals who can help them accelerate their development cycles, while also improving their site reliability and security. In this article, I want to explore the role of SRE and the reasons it has become so important.

What are the key responsibilities of Site Reliability Engineering?

Site reliability engineering (SRE) is a set of practices to ensure systems reliability and maintainability. An SRE team defines best practices, automation, and metrics to find creative solutions – for example, when sites slow to the point of user frustration. The team aims to find the balance between reliability and feature velocity.

It’s a discipline that incorporates aspects of software development and applies them to problems and tasks in infrastructure and operations.

Some examples of the responsibilities that an SRE team will be tasked with include:

  • Deployment. How are changes published? How are changes brought to the server? How do you avoid service drops during a deployment?
  • Setup. How do you install the local development environment and new servers? Is it manual?
  • Automation. Which tasks in the development process are done manually and which can be automated?
  • Access control. How do you control access to resources in protected environments? Who provides and revokes permissions? Is that permission temporary or permanent?
  • Configuration. How are the configurations handled in remote environments, such as secrets (API keys, passwords, etc)? How are they modified in each deployment?
  • Security. How do you ensure there is security awareness in all developments and also promote it within teams?
  • Monitoring. How are services controlled in productive environments? How does the team find out about availability problems?
  • Error management. How is the team made aware of errors in development? How are these errors handled if they occur in large quantities?

 

Putting it into action: What does the Site Reliability Engineering team do?

SRE teams are in charge of many different tasks. To understand more about the role, we decided to sit down with some of the engineers in Moove It’s SRE Studio. The studio works with our clients, helping them to improve the productivity of their own development teams, while also maximizing the reliability of large-scale systems.

Currently they are working with several of our clients to:

  • Define or develop tools that make the development process more efficient, reliable, and more secure. This also involves helping our client senior executives to decide on tools to solve operations tasks.
  • Manage and automate infrastructure. Particularly with most organizations shifting to cloud first environments, the SRE studio has defined processes for automating the creation and installation of instances in cloud services.
  • Provide support for problems in production environments. From issues with continuous deployment to ensuring the appropriate capacity planning, the SRE studio works to ensure our clients can deploy without any issues.
  • Define internal policies both within Moove It but also in client technology organizations. In addition to defining these policies, the team performs audits on projects in order to maintain or increase the quality of solutions, as well as assess whether the policies are working or not.
  • Promote the reuse of knowledge and work. The studio isn’t always responsible for building the entire solution but they are key in communicating technology solutions, and seeing what can be converted into a reusable asset. This means we’re constantly working on increasing our development speed as we don’t always need to develop from scratch. A key part of this also involves generating documentation and training on tools that can be useful for teams.

 

A typical Site Reliability Engineering implementation roadmap

The future of Site Reliability Engineering at Moove It

We understand the importance of our SRE Studio here at Moove It. That’s why we’re incorporating the SRE team into more and more of our projects. We believe this is an essential part of our process because it gives our team the ability to define aspects of the infrastructure and operations that the solution will have from day one. It means that every organization we work with can benefit from their deep expertise, while the ability to create assets and reuse work means we can also drive greater efficiency within Moove It.

 

Want to know more about our work in site reliability engineering? Check out our brand new SRE Studio website!

Get our stories delivered from us to your inbox weekly.