正在加载,请稍候…

生产环境中的 Elasticsearch:索引、搜索相关性与集群运维

构建生产级 Elasticsearch:ILM 管理时序数据、映射设计、BM25 与向量搜索相关性调优、批量索引及集群健康管理。

生产环境中的 Elasticsearch:索引、搜索相关性与集群运维

生产环境中的 Elasticsearch:实用运维指南

Elasticsearch 为 GitHub、Wikipedia 和 Shopify 的搜索功能提供支持。生产运维需要理解其与传统数据库截然不同的特性。

索引映射设计

与传统数据库不同,更改现有索引的 Elasticsearch 映射需要完全重建索引。务必在初始阶段设计正确:

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "index.refresh_interval": "30s",
    "analysis": {
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball", "synonym_filter"]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": ["iphone,apple phone => phone", "laptop,notebook,computer"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product_id": { "type": "keyword" },
      "name": {
        "type": "text",
        "analyzer": "product_analyzer",
        "fields": {
          "keyword": { "type": "keyword" },
          "completion": { "type": "completion" }
        }
      },
      "price": { "type": "scaled_float", "scaling_factor": 100 },
      "category": { "type": "keyword" },
      "in_stock": { "type": "boolean" },
      "embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

生产环境中的 Elasticsearch:索引、搜索相关性与集群运维 插图

索引生命周期管理(ILM)

自动化时序索引的滚动、优化和删除:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "1d", "max_docs": 100000000 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "cold": { "min_age": "30d", "actions": { "freeze": {} } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

搜索相关性调优

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [{
            "multi_match": {
              "query": "wireless headphones",
              "fields": ["name^3", "description^1", "tags^2"],
              "type": "best_fields",
              "fuzziness": "AUTO",
              "minimum_should_match": "75%"
            }
          }],
          "filter": [
            { "term": { "in_stock": true } },
            { "range": { "price": { "gte": 20, "lte": 500 } } }
          ]
        }
      },
      "functions": [
        { "gauss": { "rating": { "origin": 5.0, "scale": 2.0, "decay": 0.5 } }, "weight": 2 },
        { "gauss": { "created_at": { "origin": "now", "scale": "30d", "decay": 0.5 } }, "weight": 0.5 }
      ],
      "score_mode": "multiply"
    }
  }
}

生产环境中的 Elasticsearch:索引、搜索相关性与集群运维 插图

混合搜索:BM25 + 向量嵌入

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

es = Elasticsearch("https://localhost:9200")
model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_search(query: str, size: int = 10):
    embedding = model.encode(query).tolist()
    return es.search(
        index='products',
        body={
            "query": {
                "bool": {
                    "should": [{
                        "multi_match": {
                            "query": query,
                            "fields": ["name^2", "description"],
                            "boost": 0.7
                        }
                    }]
                }
            },
            "knn": {
                "field": "embedding",
                "query_vector": embedding,
                "k": 10,
                "num_candidates": 100,
                "boost": 0.3
            },
            "size": size
        }
    )["hits"]["hits"]

批量索引

from elasticsearch.helpers import parallel_bulk

def generate_actions(products):
    for p in products:
        yield {"_index": "products", "_id": p["product_id"], "_source": p}

for ok, action in parallel_bulk(
    es, generate_actions(products),
    thread_count=4, chunk_size=1000,
    max_chunk_bytes=10 * 1024 * 1024
):
    if not ok:
        handle_error(action)

生产环境中的 Elasticsearch:索引、搜索相关性与集群运维 插图

集群健康监控

GET /_cluster/health
# green: 一切正常 | yellow: 副本未分配 | red: 存在数据丢失风险

GET /_cat/nodes?v&h=name,heapPercent,cpu,load_1m
# heapPercent > 85%: GC 压力警告

# 修复未分配的副本(单节点开发环境)
PUT /my_index/_settings
{ "number_of_replicas": 0 }

# 解释分配问题
GET /_cluster/allocation/explain

# 断路器(防止 OOM)
PUT /_cluster/settings
{
  "persistent": { "indices.breaker.total.limit": "70%" }
}

慢日志

PUT /my_index/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "1s",
  "index.search.slowlog.threshold.fetch.warn": "500ms"
}

只读索引的 Forcemerge

POST /my_old_index/_forcemerge?max_num_segments=1
# 减少段数量,为冷索引回收磁盘空间

运行平稳的 Elasticsearch 集群,其团队通常会在前期定义好映射、为时序数据实施 ILM、针对业务领域调优相关性,并主动监控堆内存使用情况。