
Prometheus 与 Grafana 生产环境搭建:抓取配置、告警规则与仪表盘最佳实践
Prometheus 和 Grafana 构成了大多数基于 Kubernetes 平台的可观测性基石。Prometheus 收集并存储时间序列指标;Grafana 将其可视化并与 Alertmanager 集成以实现通知。本指南涵盖了从首次抓取到可操作告警的完整生产环境搭建。
架构概览
Application --> Prometheus --> Alertmanager --> PagerDuty / Slack
|
Grafana (dashboards)
|
Thanos (long-term storage)
关键组件:
- Prometheus server 抓取目标、评估规则、存储 TSDB 数据
- Alertmanager 去重、分组并路由告警
- Pushgateway 接收批处理任务的指标
- Exporters 暴露第三方系统的指标

使用 kube-prometheus-stack 安装
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values prometheus-values.yaml
生产环境 values.yaml
prometheus:
prometheusSpec:
retention: 15d
retentionSize: 50GB
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 2
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
grafana:
adminPassword: "changeme"
persistence:
enabled: true
size: 10Gi
alertmanager:
alertmanagerSpec:
resources:
requests:
memory: 128Mi
抓取配置
Kubernetes 服务的 ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-api
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: my-api
namespaceSelector:
matchNames:
- production
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
外部系统的静态抓取配置
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com/health
- https://app.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115

PromQL 要点
错误率
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
延迟百分位数
histogram_quantile(0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
资源利用率
# 每个 Pod 的 CPU 使用率 (%)
100 * sum by (pod, namespace) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
) / sum by (pod, namespace) (
kube_pod_container_resource_limits{resource="cpu"}
)
# 内存使用 vs 限制
sum by (pod) (container_memory_working_set_bytes{container!=""})
/ sum by (pod) (kube_pod_container_resource_limits{resource="memory"})
告警规则

PrometheusRule 资源
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: api.rules
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.01
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook_url: "https://runbooks.example.com/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.service }}"
- alert: ServiceDown
expr: up{job=~"my-api.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
多窗口燃烧率 SLO 告警
- alert: ErrorBudgetBurnHigh
expr: |
(
job:slo_errors_per_request:ratio_rate1h{job="api"} > (14.4 * 0.001)
) and (
job:slo_errors_per_request:ratio_rate5m{job="api"} > (14.4 * 0.001)
)
labels:
severity: critical
page: "true"
annotations:
summary: "High error budget burn rate"
description: "Burning 14.4x budget. Investigate immediately."
Alertmanager 配置
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'service', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: default
routes:
- matchers:
- severity = critical
receiver: pagerduty
- matchers:
- team = backend
receiver: slack-backend
receivers:
- name: default
slack_configs:
- channel: '#alerts'
- name: pagerduty
pagerduty_configs:
- service_key: 'PAGERDUTY_KEY'
- name: slack-backend
slack_configs:
- channel: '#backend-alerts'
send_resolved: true
inhibit_rules:
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: ['alertname', 'service']
Grafana 仪表盘最佳实践
按层级组织仪表盘:
- 概览 — 所有服务的 RED 指标(速率、错误、持续时间)
- 服务 — 单个服务的下钻
- 基础设施 — 节点/集群资源利用率
- 业务 — 面向用户的 KPI
可复用仪表盘的变量模板
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 2
},
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
"refresh": 2
}
]
}
}
使用 Thanos 进行长期存储
containers:
- name: thanos-sidecar
image: quay.io/thanos/thanos:v0.35.0
args:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/objstore.yaml
# objstore.yaml
type: S3
config:
bucket: my-prometheus-metrics
endpoint: s3.amazonaws.com
region: us-east-1
结论
生产环境的 Prometheus/Grafana 栈需要仔细关注保留策略、告警逻辑和仪表盘设计。ServiceMonitor 使 Kubernetes 服务发现声明式化。多窗口燃烧率告警提供基于 SLO 的可靠性信号。Thanos 将存储扩展到本地 TSDB 限制之外。遵循这些模式,您的监控栈将成为一流的可靠性工具,而非事后补充。