DevOpsMonitoringInfrastructure

Building an
Observability Stack

How we built a production-grade monitoring system with Grafana, VictoriaMetrics, vmagent, and structured logging.

January 2025·12 min read

Metrics Pipeline

📡

Scrape

⚡

Transform

💾

Store

📊

Query

The Problem

Flying Blind

3 AM. API failing intermittently. Logs telling us nothing useful. We knew something was wrong—we just couldn't see it.

That incident changed everything. We needed more than logs. We needed to understand our systems in real-time.

Observability isn't optional. It's survival.

Architecture

The Stack

Four components working together: vmagent scrapes metrics from services. VictoriaMetrics stores them efficiently. Grafana visualizes and alerts. Vector + Seq handle logs.

Each piece is replaceable. vmagent speaks Prometheus protocol. VictoriaMetrics accepts standard PromQL. The architecture is modular by design.

Storage Efficiency (30-day retention)

Prometheus~50 GB

VictoriaMetrics~5 GB

~10× compression advantage

Storage

Why VictoriaMetrics

Prometheus is the industry standard. It works. But VictoriaMetrics offered compelling advantages for our use case.

10× compression means months of retention at fraction of the cost. Queries that took seconds complete in milliseconds. Single-binary deployment simplifies operations.

Same PromQL. Better efficiency. Simpler ops.

Collection

vmagent: The Collector

Scrapes /metrics endpoints from services and pushes to storage via remote write. Prometheus-compatible configuration.

🎯

Target Discoveryvmagent reads scrape configs

📡

HTTP ScrapeGET /metrics every 10s

📊

Parse MetricsExtract Prometheus format

💾

Remote WritePush to VictoriaMetrics

📈

Query & VisualizeGrafana PromQL dashboards

scrape_configs.yaml

1scrape_configs:
2  - job_name: 'api-server'
3    scrape_interval: 10s
4    static_configs:
5      - targets: ['api:8080']
6    relabel_configs:
7      - source_labels: [__address__]
8        target_label: instance
9 
10  - job_name: 'web-app'
11    scrape_interval: 15s
12    metrics_path: '/api/metrics'
13    static_configs:
14      - targets: ['web:3000']

Scrape Configuration

job_name groups related targets for labeling and organization.

scrape_interval controls collection frequency. 10-15s is typical for most services.

relabel_configs transform labels before storage. Essential for organizing high-cardinality data.

Format

Prometheus Exposition

Services expose metrics at /metrics in Prometheus text format. Human-readable, easy to debug, universally supported.

Counters for totals (requests, errors). Gauges for current values (connections, memory). Histograms for distributions (latency percentiles).

metrics.rs

1use metrics::{counter, histogram};
2 
3pub fn record_request(method: &str, status: u16, duration: f64) {
4    counter!("http_requests_total",
5        "method" => method,
6        "status" => status.to_string()
7    ).increment(1);
8 
9    histogram!("http_request_duration_seconds",
10        "method" => method
11    ).record(duration);
12}

GET /api/metrics200 OK

# HELP http_requests_total Total number of HTTP requests

# TYPE http_requests_total counter

http_requests_total{method="GET",path="/api/users",status="200"} 1547

http_requests_total{method="POST",path="/api/users",status="201"} 342

http_requests_total{method="GET",path="/api/users",status="500"} 3

# HELP http_request_duration_seconds HTTP request duration

# TYPE http_request_duration_seconds histogram

http_request_duration_seconds_bucket{le="0.01"} 1203

http_request_duration_seconds_bucket{le="0.05"} 1687

http_request_duration_seconds_bucket{le="+Inf"} 1892

Visualization

Grafana: Dashboards

Transforms raw metrics into understanding. Not just graphs—answers to the questions you'll ask during incidents.

Requests/sec

+12%

P99 Latency

42ms

-8%

Error Rate

0.12%

+0.02%

Uptime

99.97%

HTTP Requests/sec

Last 24h

1031848665

Response Latency (ms)

Last 24h

554637

Query Language

The Art of PromQL

Error Rate (%)

Percentage of 5xx responses over total requests in the last 5 minutes.

Code

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

P99 Latency

99th percentile response time from histogram buckets.

Code

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m]))
  by (le)
)

Alerting

Alert on Symptoms

Alerts are a double-edged sword. Too few and you miss problems. Too many and you train yourself to ignore them.

Our philosophy: alert on symptoms, not causes. "Error rate above 5%" is actionable. "CPU above 80%" might be normal during peak load.

Tier by severity. Critical pages on-call. Warning creates tickets. Info logs for trend analysis.

CRITICALHigh Error Rate

sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

WARNINGElevated Latency

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5

INFOLow Request Rate

PENDING

sum(rate(http_requests_total[5m])) < 10

Collectors

Beyond Application Metrics

Applications are just one piece. True observability requires visibility into network infrastructure, third-party services, and distributed traces.

📡

Network Infrastructure

Export metrics from network devices—switches, access points, gateways. Track client counts, throughput, DPI stats without polling SNMP.

Unpoller · sFlow-RT · SNMP Exporter

🌐

Edge & CDN

Pull analytics from edge providers. Request volumes, cache hit ratios, threat mitigation stats. Visibility into traffic before it reaches origin.

Cloudflare Exporter · Fastly · Akamai

🔗

OpenTelemetry

Vendor-neutral standard for traces, metrics, and logs. Instrument once, export anywhere. OTEL Collector bridges ecosystems.

otelcol-contrib · Jaeger · Tempo

🔒

Security Monitoring

Track authentication events, failed logins, certificate expiry. Integrate with identity providers for session analytics.

Auth exporters · Cert checks · WAF logs

📦

DevOps Pipeline

Git repository stats, CI/CD pipeline durations, artifact storage usage. Connect deployment frequency to production health.

Gitea/GitHub exporters · Jenkins · ArgoCD

🗄️

Data Infrastructure

Database connection pools, query latencies, replication lag. Cache hit rates, queue depths, storage utilization.

PostgreSQL · Redis · RabbitMQ exporters

vector.toml

1[sources.docker_logs]
2type = "docker_logs"
3include_containers = ["api", "web"]
4 
5[transforms.parse_json]
6type = "remap"
7inputs = ["docker_logs"]
8source = '''
9. = parse_json!(.message)
10.timestamp = now()
11'''
12 
13[sinks.seq]
14type = "http"
15inputs = ["parse_json"]
16uri = "http://seq:5341/api/events/raw"
17encoding.codec = "json"

Logging

Vector + Seq

Metrics tell you what's happening. Logs tell you why. Our logging stack complements metrics perfectly.

Vector aggregates logs from containers, parses them into structured format, and forwards to Seq for search and correlation.

When Grafana shows error spikes, jump directly to Seq with the relevant time range and service filter.

The Outcome

We no longer get surprised by production issues. Problems are visible before users report them. Deployments include metric verification. On-call rotations are calmer.

The 3 AM incident that started this journey couldn't happen today. That's the promise of observability: not just knowing something is wrong, but understanding enough to fix it fast.

See Full Case Study→Discuss Your Stack

Stack Components

Core Pipeline

Collectors

Logging

Building anObservability Stack

Flying Blind

The Stack

Why VictoriaMetrics

vmagent: The Collector

Scrape Configuration

Prometheus Exposition

Grafana: Dashboards

The Art of PromQL

Error Rate (%)

P99 Latency

Alert on Symptoms

Beyond Application Metrics

Network Infrastructure

Edge & CDN

OpenTelemetry

Security Monitoring

DevOps Pipeline

Data Infrastructure

Vector + Seq

The Outcome

Stack Components

Building an
Observability Stack