Site Reliability Engineering - course 65,000 rub. from Slurm, training, Date January 1, 2024.

Miscellaneous / by admin / November 29, 2023

click fraud protection

TO PEOPLE
An SRE engineer can be either an operations engineer or a developer. During the intensive course, you will practice a lot, and the skills and knowledge you gain can be adapted and implemented in any field.

BUSINESS
SRE solves the same problems as DevOps: it increases the speed of releasing new features and improves processes within the team. But the main task of SRE is to ensure the stability and reliability of services, excluding situations where users complain about failures, and engineers have green schedules.

We are building:
Our training site consists of several microservices. It aggregates data on shows, prices and available seats from all cinemas, shows movie announcements, allows you to select a cinema, show, hall and place, book and pay for tickets.

We will formulate SLO, SLI, SLA indicators for this site, develop an architecture and infrastructure that will support them, set up monitoring and alerting.

Developer errors, infrastructure failures, an influx of visitors, and DoS attacks lead to worsening SLOs.

instagram viewer

We analyze stability, error budget, testing practice, management of interruptions and operational load.

There was an accident. The payment processing service is down. How to act to restore functionality in the shortest possible time?

We organize the work of the emergency response team: involving colleagues, notifying stakeholders, setting priorities. We train to work under pressure in extremely limited time conditions.

Let's look at the approach to the site from an SRE point of view. We analyze incidents (causes of occurrence, progress of elimination). We make decisions to prevent them further: we improve monitoring, change the architecture, the approach to development and operation, and regulations. We automate processes.

— We have dozens of built infrastructures and hundreds of written CI/CD pipelines,
— Certified Kubernetes Administrator,
— Author of several courses on Kubernetes and DevOps,
— Regular speaker at Russian and international IT conferences.

DAY 1: AMA kick-off session

We will discuss the goals and objectives of the course, and also tell you what SRE is and divide it into teams.

Opening of 2 theoretical topics:

Topic 1: Monitoring

Why is monitoring needed?
Percentiles
Alerting
Observability

Topic 2: SRE Theory

SLO, SLI, SLA
Durability
Error budget

DAY 2: analysis of practices and cases

Practice: Making a basic dashboard and setting up the necessary alerts

Practice: Adding SLO/SLI + alerts to the dashboard

Practice: First system load

Case 1 solution: downstream dependency.

In a large system, there are many interdependent services, and they do not always work equally well. It’s especially annoying when your service is in order, but the neighboring one, on which you depend, periodically goes down.

The educational project will find itself in exactly these conditions, and you will ensure that it still produces quality at the highest possible level.

DAY 3: AMA session, questions answered

Access to the 2nd theoretical module opens:

Solving problems with the environment and architecture

The second module is built around solving two cases: upstream dependency and architectural problems. Speakers will talk about incident management, rules for the fire brigade and working with post mortems and provide templates that you can use in your team.

Topic 3: Incident Management

Resilience Engineering
How a fire brigade is formed
How effective is your team in the incident?
7 rules for an incident leader
5 rules for a firefighter
HiPPO - highest paid person's opinion. Communications Leader

TTheme 4: Varrum tools and alert management.

Best practice of other companies in organizing incident management.

DAY 4: analysis of practices and cases

Solution to case 2: upstream dependency.

It's one thing when you depend on a service with a low SLO. It's another matter when your service is the same for other parts of the system. This happens if the evaluation criteria are not consistent: for example, you respond to a request within a second and consider it a success, but the dependent service waits only 500 Moscow time and leaves with an error.

In the case, we will discuss the importance of harmonizing metrics and learn to look at quality through the eyes of the client.

Solution to case 3: problems with the database.

The database can also be a source of problems. For example, if you do not monitor the replication relay, the replica will become outdated and the application will return old data. Moreover, debugging such cases is especially difficult: now the data is inconsistent, but after a few seconds it is no longer consistent, and it is not clear what the cause of the problem is.

Through the case, you will feel all the pain of debugging and learn how to prevent such problems.

Practice: We write a postmortem on the previous case and discuss it with the speakers.

DAY 5: AMA session, questions answered

AMA session and answers to questions on previous topics.

Access to the 3rd theoretical module opens:

Traffic shielding and canary releases

In the third module we will analyze a case dedicated to a problem with the environment (there will be a detailed analysis of Health Checking), and we will also step-by-step analyze how to implement SRE in companies and learn the experience of the companies where the speakers work intensive

Topic 5: Health Checking

Health Check in Kubernetes
Is our service still alive?
Exec probes
InitialDelaySeconds
Secondary Health Port
Sidecar Health Server
Headless Probe
Hardware Probe

Topic 6: Deployment methods

Topic 7: SRE project onboarding

Large companies often form a separate SRE team, which takes on the services of other departments for support. But not every service is ready to be accepted for support. We'll tell you what requirements it must meet. Speakers will also share their experience, how they implemented SRE and what mistakes they made.

DAY 6: analysis of practices and cases

Solution to case 4: there is a problem with the environment, it is impossible to buy tickets.

Healthcheck's task is to detect a broken service and block traffic to it. And if you think that for this it is enough to make a request to the service with root and receive a response, then you you are mistaken: even if the service responds, this does not guarantee its operation - problems may arise in surroundings.

Through this case, you will learn how to configure the correct Healthcheck and not allow traffic to go where it cannot be processed.

Summarizing

Tags cloud

Miscellaneous

Rating

Views

Comments