
“能运行”与“生产就绪”之间的差距
让应用在 Kubernetes 上运行很简单。但让它达到生产就绪——具备适当的资源限制、安全策略、可观测性和运维手册——则需要更多工作。
本指南涵盖了经验丰富的 Kubernetes 工程师在上生产环境之前应用的模式和配置。

资源请求与限制:正确设置
错误的资源请求是生产问题最常见的原因之一:
resources:
requests:
memory: "256Mi" # 请求:调度器据此进行放置
cpu: "100m" # 节点必须拥有 256Mi + 100m 可用
limits:
memory: "512Mi" # 限制:容器超出此值将被杀死(OOMKilled)
cpu: "1000m" # 限制:CPU 超出此值将被节流(不会被杀死)
设置正确的值:
# 步骤 1:先在开发/预发布环境不设限制部署
# 步骤 2:监控实际使用情况
kubectl top pods --containers
# 或使用 Vertical Pod Autoscaler 的推荐模式:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # "Off" = 仅推荐,不会更改运行中的 Pod
常见错误:
# ❌ 没有请求——调度器不知道将 Pod 放在哪里
# ❌ 请求 = 限制——没有突发空间,持续节流
# ❌ 非常高的 CPU 限制——使节点上的其他 Pod 饥饿
# ❌ 非常高的内存限制——可能导致节点本身 OOM
# ✅ 模式:请求 ≈ 平均使用量,限制 = 2-3 倍请求
resources:
requests:
memory: "256Mi" # ~平均内存使用量
cpu: "100m" # ~平均 CPU 使用量
limits:
memory: "512Mi" # 2 倍以应对峰值
cpu: "500m" # 5 倍以应对 CPU 突发(节流优于杀死)
PodDisruptionBudgets
PDB 防止 Kubernetes 同时驱逐过多 Pod(在节点维护、升级期间):
# 确保始终至少有 2 个 Pod 可用
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2 # 或:maxUnavailable: 1
selector:
matchLabels:
app: my-app
# 检查 PDB 状态
kubectl get pdb
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# my-app-pdb 2 N/A 1 5d
# 当你尝试删除会违反 PDB 的 Pod 时:
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# Error: Cannot evict pod as it would violate the pod's disruption budget.

RBAC:基于角色的访问控制
# 为你的应用创建 ServiceAccount(最小权限原则)
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
namespace: production
---
# Role:允许的权限(命名空间范围)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: my-app-role
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["my-app-secrets"] # 仅限特定 Secret
verbs: ["get"]
---
# RoleBinding:将角色绑定到 ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: my-app-binding
namespace: production
subjects:
- kind: ServiceAccount
name: my-app
namespace: production
roleRef:
kind: Role
apiGroupGroup: rbac.authorization.k8s.io
name: my-app-role
# 开发者访问——对特定命名空间的只读权限
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: namespace-viewer
rules:
- apiGroups: ["", "apps", "autoscaling"]
resources: ["pods", "deployments", "replicasets", "services", "configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: [] # 出于安全考虑,不允许 exec
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: dev-team-viewer
namespace: staging
subjects:
- kind: Group
name: "dev-team" # 来自你的 OIDC 提供商
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: namespace-viewer
apiGroup: rbac.authorization.k8s.io
网络策略
默认情况下,所有 Pod 可以相互通信。网络策略添加防火墙规则:
# 默认拒绝命名空间中的所有入站流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {} # 应用于所有 Pod
policyTypes:
- Ingress
---
# 允许特定流量到达 API
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend # 仅来自前端 Pod
- namespaceSelector:
matchLabels:
name: monitoring # 以及来自监控命名空间
ports:
- protocol: TCP
port: 3000
---
# 仅允许来自 API Pod 的数据库访问
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-db-from-api
spec:
podSelector:
matchLabels:
app: postgres
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 5432

使用 External Secrets Operator 管理密钥
将密钥存储在 Kubernetes Secrets 中并不理想(base64 编码,存储在 etcd 中)。使用 External Secrets Operator 配合 AWS Secrets Manager 或 HashiCorp Vault:
# 安装 External Secrets Operator 后:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secretsmanager
kind: ClusterSecretStore
target:
name: db-credentials # 创建此 K8s Secret
creationPolicy: Owner
data:
- secretKey: url # K8s Secret 的键
remoteRef:
key: production/db # AWS Secrets Manager 路径
property: url # Secret 中的 JSON 属性
- secretKey: password
remoteRef:
key: production/db
property: password
可观测性栈
# ServiceMonitor(Prometheus Operator)——抓取应用的指标
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-metrics
labels:
release: prometheus # 必须匹配 Prometheus operator 的选择器
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics # Service 中的端口名称
path: /metrics
interval: 30s
---
# PrometheusRule——定义告警
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
spec:
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5..", app="my-app"}[5m]) /
rate(http_requests_total{app="my-app"}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "my-app 错误率高"
description: "错误率为 {{ $value | humanizePercentage }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
部署策略
# 蓝绿部署与 Service
---
# 蓝色部署(当前)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-blue
spec:
replicas: 3
selector:
matchLabels:
app: my-app
version: blue
template:
metadata:
labels:
app: my-app
version: blue
spec:
containers:
- name: app
image: my-app:1.0.0
---
# 绿色部署(新版本)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-green
spec:
replicas: 3
selector:
matchLabels:
app: my-app
version: green
template:
metadata:
labels:
app: my-app
version: green
spec:
containers:
- name: app
image: my-app:2.0.0
---
# Service——通过更改选择器切换流量
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
version: blue # ← 改为 "green" 以切换流量
ports:
- port: 80
targetPort: 3000
# 金丝雀部署,带权重流量拆分
# (使用 Nginx Ingress 注解)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app-canary
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # 10% 流量到新版本
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app-v2 # 新版本
port:
number: 80
Init 容器与 Sidecar
spec:
initContainers:
# 在主容器启动前运行
- name: migrate-db
image: my-app:2.0.0
command: ["./migrate", "--up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: wait-for-db
image: busybox
command: ['sh', '-c',
'until nc -z postgres-service 5432; do echo waiting; sleep 2; done']
containers:
- name: app
image: my-app:2.0.0
# Sidecar 容器
- name: log-shipper
image: fluent-bit:latest
# 将日志从共享卷发送到日志系统
volumeMounts:
- name: log-volume
mountPath: /var/log/app
- name: cloud-sql-proxy # GCP Cloud SQL 的常见模式
image: gcr.io/cloudsql-docker/gce-proxy:latest
command: ["/cloud_sql_proxy", "-instances=project:region:instance=tcp:5432"]
学习 Kubernetes 生产模式最大的生产力提升在于理解声明式模型意味着你描述你想要的状态,集群会持续收敛到该状态。调试主要是问:“期望状态是什么?实际状态是什么?为什么它们不匹配?”
→ 使用 Chmod 计算器 计算 Linux 文件权限。