Back to Blog
DevOpsMonitoringInfrastructure

Building an
Observability Stack

How we built a production-grade monitoring system with Grafana, VictoriaMetrics, vmagent, and structured logging.

·12 min read
Metrics Pipeline
📡
Scrape
Transform
💾
Store
📊
Query
The Problem

Flying Blind

3 AM. API failing intermittently. Logs telling us nothing useful. We knew something was wrong—we just couldn't see it.

That incident changed everything. We needed more than logs. We needed to understand our systems in real-time.

Observability isn't optional. It's survival.

Architecture

The Stack

Four components working together: vmagent scrapes metrics from services. VictoriaMetrics stores them efficiently. Grafana visualizes and alerts. Vector + Seq handle logs.

Each piece is replaceable. vmagent speaks Prometheus protocol. VictoriaMetrics accepts standard PromQL. The architecture is modular by design.

Next.js AppRust APIvmagentVictoriaMetricsGrafana
Storage Efficiency (30-day retention)
Prometheus~50 GB
VictoriaMetrics~5 GB
~10× compression advantage
Storage

Why VictoriaMetrics

Prometheus is the industry standard. It works. But VictoriaMetrics offered compelling advantages for our use case.

10× compression means months of retention at fraction of the cost. Queries that took seconds complete in milliseconds. Single-binary deployment simplifies operations.

Same PromQL. Better efficiency. Simpler ops.

Collection

vmagent: The Collector

Scrapes /metrics endpoints from services and pushes to storage via remote write. Prometheus-compatible configuration.

🎯
Target Discoveryvmagent reads scrape configs
📡
HTTP ScrapeGET /metrics every 10s
📊
Parse MetricsExtract Prometheus format
💾
Remote WritePush to VictoriaMetrics
📈
Query & VisualizeGrafana PromQL dashboards
scrape_configs.yaml
1scrape_configs:
2 - job_name: 'api-server'
3 scrape_interval: 10s
4 static_configs:
5 - targets: ['api:8080']
6 relabel_configs:
7 - source_labels: [__address__]
8 target_label: instance
9 
10 - job_name: 'web-app'
11 scrape_interval: 15s
12 metrics_path: '/api/metrics'
13 static_configs:
14 - targets: ['web:3000']

Scrape Configuration

job_name groups related targets for labeling and organization.

scrape_interval controls collection frequency. 10-15s is typical for most services.

relabel_configs transform labels before storage. Essential for organizing high-cardinality data.

Format

Prometheus Exposition

Services expose metrics at /metrics in Prometheus text format. Human-readable, easy to debug, universally supported.

Counters for totals (requests, errors). Gauges for current values (connections, memory). Histograms for distributions (latency percentiles).

metrics.rs
1use metrics::{counter, histogram};
2 
3pub fn record_request(method: &str, status: u16, duration: f64) {
4 counter!("http_requests_total",
5 "method" => method,
6 "status" => status.to_string()
7 ).increment(1);
8 
9 histogram!("http_request_duration_seconds",
10 "method" => method
11 ).record(duration);
12}
GET /api/metrics200 OK
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/users",status="200"} 1547
http_requests_total{method="POST",path="/api/users",status="201"} 342
http_requests_total{method="GET",path="/api/users",status="500"} 3

# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 1203
http_request_duration_seconds_bucket{le="0.05"} 1687
http_request_duration_seconds_bucket{le="+Inf"} 1892
Visualization

Grafana: Dashboards

Transforms raw metrics into understanding. Not just graphs—answers to the questions you'll ask during incidents.

Requests/sec
0
+12%
P99 Latency
42ms
-8%
Error Rate
0.12%
+0.02%
Uptime
99.97%
HTTP Requests/sec
Last 24h
1031848665
Response Latency (ms)
Last 24h
554637
Query Language

The Art of PromQL

Error Rate (%)

Percentage of 5xx responses over total requests in the last 5 minutes.

Code
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

P99 Latency

99th percentile response time from histogram buckets.

Code
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m]))
by (le)
)
Alerting

Alert on Symptoms

Alerts are a double-edged sword. Too few and you miss problems. Too many and you train yourself to ignore them.

Our philosophy: alert on symptoms, not causes. "Error rate above 5%" is actionable. "CPU above 80%" might be normal during peak load.

Tier by severity. Critical pages on-call. Warning creates tickets. Info logs for trend analysis.

CRITICALHigh Error Rate
OK
sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
WARNINGElevated Latency
OK
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
INFOLow Request Rate
PENDING
sum(rate(http_requests_total[5m])) < 10
Collectors

Beyond Application Metrics

Applications are just one piece. True observability requires visibility into network infrastructure, third-party services, and distributed traces.

📡

Network Infrastructure

Export metrics from network devices—switches, access points, gateways. Track client counts, throughput, DPI stats without polling SNMP.

Unpoller · sFlow-RT · SNMP Exporter
🌐

Edge & CDN

Pull analytics from edge providers. Request volumes, cache hit ratios, threat mitigation stats. Visibility into traffic before it reaches origin.

Cloudflare Exporter · Fastly · Akamai
🔗

OpenTelemetry

Vendor-neutral standard for traces, metrics, and logs. Instrument once, export anywhere. OTEL Collector bridges ecosystems.

otelcol-contrib · Jaeger · Tempo
🔒

Security Monitoring

Track authentication events, failed logins, certificate expiry. Integrate with identity providers for session analytics.

Auth exporters · Cert checks · WAF logs
📦

DevOps Pipeline

Git repository stats, CI/CD pipeline durations, artifact storage usage. Connect deployment frequency to production health.

Gitea/GitHub exporters · Jenkins · ArgoCD
🗄️

Data Infrastructure

Database connection pools, query latencies, replication lag. Cache hit rates, queue depths, storage utilization.

PostgreSQL · Redis · RabbitMQ exporters
vector.toml
1[sources.docker_logs]
2type = "docker_logs"
3include_containers = ["api", "web"]
4 
5[transforms.parse_json]
6type = "remap"
7inputs = ["docker_logs"]
8source = '''
9. = parse_json!(.message)
10.timestamp = now()
11'''
12 
13[sinks.seq]
14type = "http"
15inputs = ["parse_json"]
16uri = "http://seq:5341/api/events/raw"
17encoding.codec = "json"
Logging

Vector + Seq

Metrics tell you what's happening. Logs tell you why. Our logging stack complements metrics perfectly.

Vector aggregates logs from containers, parses them into structured format, and forwards to Seq for search and correlation.

When Grafana shows error spikes, jump directly to Seq with the relevant time range and service filter.

The Outcome

We no longer get surprised by production issues. Problems are visible before users report them. Deployments include metric verification. On-call rotations are calmer.

The 3 AM incident that started this journey couldn't happen today. That's the promise of observability: not just knowing something is wrong, but understanding enough to fix it fast.