Course Content
1
Prometheus: Instrumenting Services and Scraping Metrics
5 lessons- 1Distinguish Counter, Gauge, Histogram, and Summary Metric Types in a Live Service30 min
- 2Deploy Prometheus and Configure Scrape Jobs Against the Order API28 min
- 3Query Service Health with PromQL: rate(), sum(), and histogram_quantile()28 min
- 4Harden Scrape Config with Relabeling and Service Discovery30 min
- 5Integrate the Prometheus Stack and Validate the Instrumentation Lab32 min
2
Grafana: Building Operations Dashboards and Visual Alerting
5 lessons- 1Connect Grafana to Prometheus and Build Your First Panel28 min
- 2Construct a Golden-Signals Dashboard: Latency, Traffic, Errors, Saturation30 min
- 3Configure Grafana Alerts with Thresholds and Notification Channels28 min
- 4Version Dashboards as Code and Eliminate Alert Fatigue30 min
- 5Integrate Grafana Stack and Stage the Dashboard Lab32 min
3
Alertmanager: Routing, Silencing, and On-Call Workflows
5 lessons- 1Write Prometheus Alerting Rules with for, Labels, and Annotations28 min
- 2Route Alerts by Severity and Team with Alertmanager28 min
- 3Suppress Noise with Inhibition Rules and Scheduled Silences28 min
- 4Define SLOs and Implement Multi-Window Burn-Rate Alerting30 min
- 5Integrate the Full Alerting Pipeline and Validate On-Call Flow32 min
4
ELK Stack: Centralized Logging and Log-Based Alerting
5 lessons5
Distributed Tracing: OpenTelemetry and Jaeger for Request Flows
5 lessons- 1Instrument a Service with OpenTelemetry SDK and Auto-Instrumentation30 min
- 2Deploy Jaeger and Propagate Trace Context Across Services28 min
- 3Find Latency Bottlenecks by Reading Trace Waterfalls in Jaeger28 min
- 4Correlate Traces, Metrics, and Logs with Shared Trace IDs30 min
- 5Integrate Tracing into Your Full Observability Stack32 min
6
SRE Practices: Incident Response, Postmortems, and Reliability Engineering
5 lessons- 1Diagnose a Live Incident Using Metrics, Logs, and Traces Together30 min
- 2Define SLIs, SLOs, and Error Budgets That Drive Engineering Decisions30 min
- 3Run the Incident Command Workflow: Roles, Severity, and Communication30 min
- 4Write a Blameless Postmortem with Timeline and Action Items28 min
- 5Integrate the Full Observability and SRE Workflow in a Capstone Lab35 min