正在加载,请稍候…

OpenTelemetry 分布式追踪:自动埋点、自定义 Span、采样策略与 Jaeger 集成

使用 OpenTelemetry 实现分布式追踪:配置 Node.js 和 Go 的自动埋点,添加自定义 Span 和属性,调整尾部采样,并在 Jaeger 中

OpenTelemetry 分布式追踪:自动埋点、自定义 Span、采样策略与 Jaeger 集成

OpenTelemetry 分布式追踪:自动埋点、自定义 Span、采样策略与 Jaeger 集成

分布式追踪回答了“这个请求为什么慢?”的问题,跨越服务边界。OpenTelemetry (OTel) 提供了与供应商无关的 SDK 和规范,可与任何后端(包括 Jaeger、Tempo、Zipkin、Honeycomb 或 Datadog)配合使用。本指南涵盖了从自动埋点到细粒度手动 Span 的埋点方法,以及控制成本而不丢失关键数据的生产采样策略。

核心概念

  • Trace 是表示单个请求旅程的 Span 集合
  • Span 是一个命名的、计时的操作;Span 形成树状结构
  • Context Propagation 通过 HTTP 头(W3C TraceContext)在服务边界传递追踪上下文(TraceID、SpanID)
  • Exporter 将 Span 发送到后端(OTLP、Jaeger、Zipkin)
  • Collector 接收、处理和导出遥测数据

OpenTelemetry 分布式追踪:自动埋点、自定义 Span、采样策略与 Jaeger 集成 示意图

架构

Service A --[OTLP/gRPC]--> OTel Collector --[OTLP]--> Jaeger
Service B --[OTLP/HTTP]--> OTel Collector --[OTLP]--> Grafana Tempo

Collector 将你的服务与后端解耦,无需修改代码即可切换后端。

Node.js 自动埋点

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/exporter-metrics-otlp-grpc
// tracing.ts - 必须在任何其他模块之前加载
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

const sdk = new NodeSDK({
  resource: new Resource({
    'service.name': process.env.SERVICE_NAME || 'my-service',
    'service.version': process.env.SERVICE_VERSION || '1.0.0',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingRequestHook: (req) =>
          req.url?.includes('/health') || req.url?.includes('/metrics'),
      },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(0.1),
  }),
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown().finally(() => process.exit(0)));
// server.ts
import './tracing';  // 必须是第一个导入
import express from 'express';

自定义 Span

自动埋点覆盖了 HTTP、数据库和缓存调用。为业务级操作添加自定义 Span:

import { trace, SpanStatusCode, SpanKind } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  return tracer.startActiveSpan('payment.process', {
    kind: SpanKind.INTERNAL,
    attributes: {
      'order.id': orderId,
      'payment.amount': amount,
      'payment.currency': 'USD',
    },
  }, async (span) => {
    try {
      const fraudResult = await tracer.startActiveSpan('payment.fraud_check', async (fraudSpan) => {
        try {
          const result = await fraudCheckService.check(orderId, amount);
          fraudSpan.setAttributes({
            'fraud.score': result.score,
            'fraud.decision': result.decision,
          });
          return result;
        } finally {
          fraudSpan.end();
        }
      });

      if (fraudResult.decision === 'block') {
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'Blocked' });
        throw new PaymentBlockedError(fraudResult.reason);
      }

      const result = await chargeCard(orderId, amount);
      span.setAttributes({ 'payment.transaction_id': result.transactionId });
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

OpenTelemetry 分布式追踪:自动埋点、自定义 Span、采样策略与 Jaeger 集成 示意图

添加事件(Span 内的日志)

span.addEvent('cache.miss', {
  'cache.key': cacheKey,
  'cache.size_bytes': keySize,
});

span.addEvent('retry.attempt', {
  'retry.count': attemptNumber,
  'retry.delay_ms': delay,
});

Go 埋点

// otel.go
package otel

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func InitTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName("my-go-service"),
        semconv.ServiceVersion("1.0.0"),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}
// 埋点 HTTP 处理器
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

http.Handle("/api/orders", otelhttp.NewHandler(orderHandler, "orders.list"))

采样策略

OpenTelemetry 分布式追踪:自动埋点、自定义 Span、采样策略与 Jaeger 集成 示意图

OTel Collector 中的尾部采样

尾部采样在收到完整追踪后做出决策:

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # 始终保留错误追踪
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      # 始终保留慢追踪(>1s)
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }

      # 保留 5% 的成功快速追踪
      - name: probabilistic-ok
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

      # 始终保留重要用户
      - name: important-users
        type: string_attribute
        string_attribute:
          key: user.tier
          values: [enterprise, vip]

OpenTelemetry Collector 配置

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 5s
    limit_mib: 512

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger, otlp/tempo]

Jaeger 部署

# jaeger-all-in-one 用于开发环境
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  selector:
    matchLabels:
      app: jaeger
  template:
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:1.57
          ports:
            - containerPort: 16686  # UI
            - containerPort: 4317   # OTLP gRPC
            - containerPort: 4318   # OTLP HTTP
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
            - name: SPAN_STORAGE_TYPE
              value: elasticsearch
            - name: ES_SERVER_URLS
              value: http://elasticsearch:9200

追踪与日志关联

将追踪链接到日志以进行上下文关联:

import { trace, context } from '@opentelemetry/api';
import winston from 'winston';

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.printf(({ level, message, timestamp, ...meta }) => {
      const span = trace.getActiveSpan();
      const traceId = span?.spanContext().traceId;
      const spanId = span?.spanContext().spanId;
      return JSON.stringify({ timestamp, level, message, traceId, spanId, ...meta });
    })
  ),
  transports: [new winston.transports.Console()],
});

结论

OpenTelemetry 提供了与供应商无关、面向未来的可观测性基础。自动埋点以最少的代码更改捕获了大部分 Span。自定义 Span 暴露了基础设施指标无法提供的业务级上下文。Collector 中的尾部采样控制了成本,同时不牺牲对错误和慢请求的可见性。Jaeger 或 Grafana Tempo 提供了导航这些追踪并快速诊断生产问题的 UI。