Site Reliability Engineering - course 65,000 rub. from Slurm, training, Date January 1, 2024.
Miscellaneous / / November 29, 2023
TO PEOPLE
An SRE engineer can be either an operations engineer or a developer. During the intensive course, you will practice a lot, and the skills and knowledge you gain can be adapted and implemented in any field.
BUSINESS
SRE solves the same problems as DevOps: it increases the speed of releasing new features and improves processes within the team. But the main task of SRE is to ensure the stability and reliability of services, excluding situations where users complain about failures, and engineers have green schedules.
We are building:
Our training site consists of several microservices. It aggregates data on shows, prices and available seats from all cinemas, shows movie announcements, allows you to select a cinema, show, hall and place, book and pay for tickets.
We will formulate SLO, SLI, SLA indicators for this site, develop an architecture and infrastructure that will support them, set up monitoring and alerting.
Developer errors, infrastructure failures, an influx of visitors, and DoS attacks lead to worsening SLOs.
We analyze stability, error budget, testing practice, management of interruptions and operational load.
There was an accident. The payment processing service is down. How to act to restore functionality in the shortest possible time?
We organize the work of the emergency response team: involving colleagues, notifying stakeholders, setting priorities. We train to work under pressure in extremely limited time conditions.
Let's look at the approach to the site from an SRE point of view. We analyze incidents (causes of occurrence, progress of elimination). We make decisions to prevent them further: we improve monitoring, change the architecture, the approach to development and operation, and regulations. We automate processes.
— We have dozens of built infrastructures and hundreds of written CI/CD pipelines,
— Certified Kubernetes Administrator,
— Author of several courses on Kubernetes and DevOps,
— Regular speaker at Russian and international IT conferences.
DAY 1: AMA kick-off session
We will discuss the goals and objectives of the course, and also tell you what SRE is and divide it into teams.
Opening of 2 theoretical topics:
Topic 1: Monitoring
- Why is monitoring needed?
- Percentiles
- Alerting
- Observability
Topic 2: SRE Theory
- SLO, SLI, SLA
- Durability
- Error budget
DAY 2: analysis of practices and cases
Practice: Making a basic dashboard and setting up the necessary alerts
Practice: Adding SLO/SLI + alerts to the dashboard
Practice: First system load
Case 1 solution: downstream dependency.
In a large system, there are many interdependent services, and they do not always work equally well. It’s especially annoying when your service is in order, but the neighboring one, on which you depend, periodically goes down.
The educational project will find itself in exactly these conditions, and you will ensure that it still produces quality at the highest possible level.
DAY 3: AMA session, questions answered
Access to the 2nd theoretical module opens:
Solving problems with the environment and architecture
The second module is built around solving two cases: upstream dependency and architectural problems. Speakers will talk about incident management, rules for the fire brigade and working with post mortems and provide templates that you can use in your team.
Topic 3: Incident Management
- Resilience Engineering
- How a fire brigade is formed
- How effective is your team in the incident?
- 7 rules for an incident leader
- 5 rules for a firefighter
- HiPPO - highest paid person's opinion. Communications Leader
TTheme 4: Varrum tools and alert management.
Best practice of other companies in organizing incident management.
DAY 4: analysis of practices and cases
Solution to case 2: upstream dependency.
It's one thing when you depend on a service with a low SLO. It's another matter when your service is the same for other parts of the system. This happens if the evaluation criteria are not consistent: for example, you respond to a request within a second and consider it a success, but the dependent service waits only 500 Moscow time and leaves with an error.
In the case, we will discuss the importance of harmonizing metrics and learn to look at quality through the eyes of the client.
Solution to case 3: problems with the database.
The database can also be a source of problems. For example, if you do not monitor the replication relay, the replica will become outdated and the application will return old data. Moreover, debugging such cases is especially difficult: now the data is inconsistent, but after a few seconds it is no longer consistent, and it is not clear what the cause of the problem is.
Through the case, you will feel all the pain of debugging and learn how to prevent such problems.
Practice: We write a postmortem on the previous case and discuss it with the speakers.
DAY 5: AMA session, questions answered
AMA session and answers to questions on previous topics.
Access to the 3rd theoretical module opens:
Traffic shielding and canary releases
In the third module we will analyze a case dedicated to a problem with the environment (there will be a detailed analysis of Health Checking), and we will also step-by-step analyze how to implement SRE in companies and learn the experience of the companies where the speakers work intensive
Topic 5: Health Checking
- Health Check in Kubernetes
- Is our service still alive?
- Exec probes
- InitialDelaySeconds
- Secondary Health Port
- Sidecar Health Server
- Headless Probe
- Hardware Probe
Topic 6: Deployment methods
Topic 7: SRE project onboarding
Large companies often form a separate SRE team, which takes on the services of other departments for support. But not every service is ready to be accepted for support. We'll tell you what requirements it must meet. Speakers will also share their experience, how they implemented SRE and what mistakes they made.
DAY 6: analysis of practices and cases
Solution to case 4: there is a problem with the environment, it is impossible to buy tickets.
Healthcheck's task is to detect a broken service and block traffic to it. And if you think that for this it is enough to make a request to the service with root and receive a response, then you you are mistaken: even if the service responds, this does not guarantee its operation - problems may arise in surroundings.
Through this case, you will learn how to configure the correct Healthcheck and not allow traffic to go where it cannot be processed.
Summarizing