Building an
Observability Stack
How we built a production-grade monitoring system with Grafana, VictoriaMetrics, vmagent, and structured logging.
Flying Blind
3 AM. API failing intermittently. Logs telling us nothing useful. We knew something was wrong—we just couldn't see it.
That incident changed everything. We needed more than logs. We needed to understand our systems in real-time.
Observability isn't optional. It's survival.
The Stack
Four components working together: vmagent scrapes metrics from services. VictoriaMetrics stores them efficiently. Grafana visualizes and alerts. Vector + Seq handle logs.
Each piece is replaceable. vmagent speaks Prometheus protocol. VictoriaMetrics accepts standard PromQL. The architecture is modular by design.
Why VictoriaMetrics
Prometheus is the industry standard. It works. But VictoriaMetrics offered compelling advantages for our use case.
10× compression means months of retention at fraction of the cost. Queries that took seconds complete in milliseconds. Single-binary deployment simplifies operations.
Same PromQL. Better efficiency. Simpler ops.
vmagent: The Collector
Scrapes /metrics endpoints from services and pushes to storage via remote write. Prometheus-compatible configuration.
1scrape_configs:2 - job_name: 'api-server'3 scrape_interval: 10s4 static_configs:5 - targets: ['api:8080']6 relabel_configs:7 - source_labels: [__address__]8 target_label: instance9 10 - job_name: 'web-app'11 scrape_interval: 15s12 metrics_path: '/api/metrics'13 static_configs:14 - targets: ['web:3000']Scrape Configuration
job_name groups related targets for labeling and organization.
scrape_interval controls collection frequency. 10-15s is typical for most services.
relabel_configs transform labels before storage. Essential for organizing high-cardinality data.
Prometheus Exposition
Services expose metrics at /metrics in Prometheus text format. Human-readable, easy to debug, universally supported.
Counters for totals (requests, errors). Gauges for current values (connections, memory). Histograms for distributions (latency percentiles).
1use metrics::{counter, histogram};2 3pub fn record_request(method: &str, status: u16, duration: f64) {4 counter!("http_requests_total",5 "method" => method,6 "status" => status.to_string()7 ).increment(1);8 9 histogram!("http_request_duration_seconds",10 "method" => method11 ).record(duration);12}Grafana: Dashboards
Transforms raw metrics into understanding. Not just graphs—answers to the questions you'll ask during incidents.
The Art of PromQL
Error Rate (%)
Percentage of 5xx responses over total requests in the last 5 minutes.
sum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m])) * 100P99 Latency
99th percentile response time from histogram buckets.
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))Alert on Symptoms
Alerts are a double-edged sword. Too few and you miss problems. Too many and you train yourself to ignore them.
Our philosophy: alert on symptoms, not causes. "Error rate above 5%" is actionable. "CPU above 80%" might be normal during peak load.
Tier by severity. Critical pages on-call. Warning creates tickets. Info logs for trend analysis.
sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.05histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5sum(rate(http_requests_total[5m])) < 10Beyond Application Metrics
Applications are just one piece. True observability requires visibility into network infrastructure, third-party services, and distributed traces.
Network Infrastructure
Export metrics from network devices—switches, access points, gateways. Track client counts, throughput, DPI stats without polling SNMP.
Edge & CDN
Pull analytics from edge providers. Request volumes, cache hit ratios, threat mitigation stats. Visibility into traffic before it reaches origin.
OpenTelemetry
Vendor-neutral standard for traces, metrics, and logs. Instrument once, export anywhere. OTEL Collector bridges ecosystems.
Security Monitoring
Track authentication events, failed logins, certificate expiry. Integrate with identity providers for session analytics.
DevOps Pipeline
Git repository stats, CI/CD pipeline durations, artifact storage usage. Connect deployment frequency to production health.
Data Infrastructure
Database connection pools, query latencies, replication lag. Cache hit rates, queue depths, storage utilization.
1[sources.docker_logs]2type = "docker_logs"3include_containers = ["api", "web"]4 5[transforms.parse_json]6type = "remap"7inputs = ["docker_logs"]8source = '''9. = parse_json!(.message)10.timestamp = now()11'''12 13[sinks.seq]14type = "http"15inputs = ["parse_json"]16uri = "http://seq:5341/api/events/raw"17encoding.codec = "json"Vector + Seq
Metrics tell you what's happening. Logs tell you why. Our logging stack complements metrics perfectly.
Vector aggregates logs from containers, parses them into structured format, and forwards to Seq for search and correlation.
When Grafana shows error spikes, jump directly to Seq with the relevant time range and service filter.
The Outcome
We no longer get surprised by production issues. Problems are visible before users report them. Deployments include metric verification. On-call rotations are calmer.
The 3 AM incident that started this journey couldn't happen today. That's the promise of observability: not just knowing something is wrong, but understanding enough to fix it fast.