
OpenTelemetry 分布式追踪:自动埋点、自定义 Span、采样策略与 Jaeger 集成
分布式追踪回答了“这个请求为什么慢?”的问题,跨越服务边界。OpenTelemetry (OTel) 提供了与供应商无关的 SDK 和规范,可与任何后端(包括 Jaeger、Tempo、Zipkin、Honeycomb 或 Datadog)配合使用。本指南涵盖了从自动埋点到细粒度手动 Span 的埋点方法,以及控制成本而不丢失关键数据的生产采样策略。
核心概念
- Trace 是表示单个请求旅程的 Span 集合
- Span 是一个命名的、计时的操作;Span 形成树状结构
- Context Propagation 通过 HTTP 头(W3C TraceContext)在服务边界传递追踪上下文(TraceID、SpanID)
- Exporter 将 Span 发送到后端(OTLP、Jaeger、Zipkin)
- Collector 接收、处理和导出遥测数据

架构
Service A --[OTLP/gRPC]--> OTel Collector --[OTLP]--> Jaeger
Service B --[OTLP/HTTP]--> OTel Collector --[OTLP]--> Grafana Tempo
Collector 将你的服务与后端解耦,无需修改代码即可切换后端。
Node.js 自动埋点
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-metrics-otlp-grpc
// tracing.ts - 必须在任何其他模块之前加载
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
resource: new Resource({
'service.name': process.env.SERVICE_NAME || 'my-service',
'service.version': process.env.SERVICE_VERSION || '1.0.0',
'deployment.environment': process.env.NODE_ENV || 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
ignoreIncomingRequestHook: (req) =>
req.url?.includes('/health') || req.url?.includes('/metrics'),
},
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1),
}),
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown().finally(() => process.exit(0)));
// server.ts
import './tracing'; // 必须是第一个导入
import express from 'express';
自定义 Span
自动埋点覆盖了 HTTP、数据库和缓存调用。为业务级操作添加自定义 Span:
import { trace, SpanStatusCode, SpanKind } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service', '1.0.0');
async function processPayment(orderId: string, amount: number) {
return tracer.startActiveSpan('payment.process', {
kind: SpanKind.INTERNAL,
attributes: {
'order.id': orderId,
'payment.amount': amount,
'payment.currency': 'USD',
},
}, async (span) => {
try {
const fraudResult = await tracer.startActiveSpan('payment.fraud_check', async (fraudSpan) => {
try {
const result = await fraudCheckService.check(orderId, amount);
fraudSpan.setAttributes({
'fraud.score': result.score,
'fraud.decision': result.decision,
});
return result;
} finally {
fraudSpan.end();
}
});
if (fraudResult.decision === 'block') {
span.setStatus({ code: SpanStatusCode.ERROR, message: 'Blocked' });
throw new PaymentBlockedError(fraudResult.reason);
}
const result = await chargeCard(orderId, amount);
span.setAttributes({ 'payment.transaction_id': result.transactionId });
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
});
}
添加事件(Span 内的日志)
span.addEvent('cache.miss', {
'cache.key': cacheKey,
'cache.size_bytes': keySize,
});
span.addEvent('retry.attempt', {
'retry.count': attemptNumber,
'retry.delay_ms': delay,
});
Go 埋点
// otel.go
package otel
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
func InitTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("my-go-service"),
semconv.ServiceVersion("1.0.0"),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
// 埋点 HTTP 处理器
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
http.Handle("/api/orders", otelhttp.NewHandler(orderHandler, "orders.list"))
采样策略
OTel Collector 中的尾部采样
尾部采样在收到完整追踪后做出决策:
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# 始终保留错误追踪
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
# 始终保留慢追踪(>1s)
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 }
# 保留 5% 的成功快速追踪
- name: probabilistic-ok
type: probabilistic
probabilistic: { sampling_percentage: 5 }
# 始终保留重要用户
- name: important-users
type: string_attribute
string_attribute:
key: user.tier
values: [enterprise, vip]
OpenTelemetry Collector 配置
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 5s
limit_mib: 512
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, otlp/tempo]
Jaeger 部署
# jaeger-all-in-one 用于开发环境
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
selector:
matchLabels:
app: jaeger
template:
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.57
ports:
- containerPort: 16686 # UI
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
追踪与日志关联
将追踪链接到日志以进行上下文关联:
import { trace, context } from '@opentelemetry/api';
import winston from 'winston';
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.printf(({ level, message, timestamp, ...meta }) => {
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
const spanId = span?.spanContext().spanId;
return JSON.stringify({ timestamp, level, message, traceId, spanId, ...meta });
})
),
transports: [new winston.transports.Console()],
});
结论
OpenTelemetry 提供了与供应商无关、面向未来的可观测性基础。自动埋点以最少的代码更改捕获了大部分 Span。自定义 Span 暴露了基础设施指标无法提供的业务级上下文。Collector 中的尾部采样控制了成本,同时不牺牲对错误和慢请求的可见性。Jaeger 或 Grafana Tempo 提供了导航这些追踪并快速诊断生产问题的 UI。