ElasticSearch-多字段特性与配置自定义 Analyzer

1. 多字段特性

相关阅读：https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-params.html

默认情况 ElasticSearch 会对 text 类型字段默认增加 keyword 类型子字段，keyword 类型是 Exact Values，即精确值，仅支持精确匹配查询。Exact Values 不会被 ElasticSearch 做分词处理，而是直接索引。而 text 类型是全文本，即非结构化的文本数据。

如下增加索引 users 的 mapping 定义：

PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword" // 增加名称为 keyword 的 keyword 类型子字段
          }
        }
      },
      "address": {
        "type": "text",
        "fields": {
          "other_address": { // 子字段 other_address
            "type": "text",
            "analyzer": "stop", // 倒排索引中采用 stop analyzer 分词
            "search_analyzer": "whitespace" // 检索时对检索文本采用 whitespace analyzer 分词
          }
        }
      }
    }
  }
}

添加如下文档：

PUT users/_doc/1
{
  "name": "Hello ShotoZheng",
  "address": "fish Eating, I'M Loser"
}

// 无法被检索到
POST users/_search
{
  "query": {
    "match": {
      "name.keyword": "ShotoZheng"
    }
  }
}

// 可以被检索到
POST users/_search
{
  "query": {
    "match": {
      "name.keyword": "Hello ShotoZheng"
    }
  }
}

2. search_analyzer

前面的示例中我们对子字段 other_address 设置 stop analyzer，并设置 search_analyzer 为 whitespace。首先我们要理解如下概念：

analyzer：插入文档时，将text类型的字段做分词然后插入倒排索引。
search_analyzer：查询时，先对要查询的text类型的输入做分词，再去倒排索引中搜索。

如上示例中，我们采用的是 stop analyzer，那么我们可以对 address.other_address 字段所使用的 stop analyzer 分析其在倒排索引中存储的信息：

GET /users/_analyze
{
  "field": "address.other_address",  // 使用 address.english_address 的 analyzer 进行分析
  "text": "fish Eating, I'M Winer"
}

// 分析结果
{
  "tokens" : [
    {
      "token" : "fish",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "eating",
      "start_offset" : 5,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "i'm",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "winer",
      "start_offset" : 17,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

如下我们去检索 address.other_address 包含 I'M winer 的文档，由于我们设置 search_analyzer 成 whitespace，我们 I'M winer 会被分词成 I'M 和 winer，然后依次去检索倒排索引，很明显能够匹配到 winer，所以如下示例能够检索到匹配到的文档：

POST users/_search
{
  "query": {
    "match": {
      "address.other_address": "I'M winer"
    }
  }
}

我们采用如下示例则检索不到匹配的文档，因为倒排索引中不存在 Winer:

POST users/_search
{
  "query": {
    "match": {
      "address.other_address": "I'M Winer"
    }
  }
}

需要说明的是，默认情况下如果我们不设置 search_analyzer，其默认与 analyzer 所设置的一致。

3. 自定义分词

相关阅读：https://www.elastic.co/guide/en/elasticsearch/reference/7.4/analysis-analyzers.html

当 ElasticSearch 自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现。

Character Filters
Tokenizer
Token Filter

3.1 Tokenizer

Tokenizer 将原始的文本按照一定的规则切分为词（term or tolen），ElasticSearch 内置了 whitespace 、standard、keyword、path hierarchy、uax_url_emain 和 pattern 等 Tokenizer 。当然也可以使用 Java 开发插件实现自己的 Tokenizer。

如下演示 whitespace tokenizer 实现按空格对文本进行分词：

POST _analyze
{
  "tokenizer": "whitespace",
  "text": "123-456, I-test! test-990 650-555-1234"
}

// 分词结果
{
  "tokens" : [
    {
      "token" : "123-456,",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "I-test!",
      "start_offset" : 9,
      "end_offset" : 16,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "test-990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "650-555-1234",
      "start_offset" : 26,
      "end_offset" : 38,
      "type" : "word",
      "position" : 3
    }
  ]
}

3.2 Character Filters

在 Tokenizer 之前对文本进行处理，例如增加删除及替换字符。可以配置多个 Character Filters。会影响 Tokenizer 的 position 和 offset 信息。ElasticSearch 自带了如下三种 Character Filters：

HTML strip：去除 html 标签
Mapping：字符串替换
Pattern replace：正则匹配替换

如下演示几个示例：

// 使用char filter进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

// char filter 替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ ":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}

//正则表达式
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "pattern_replace",
        "pattern" : "http://(.*)",
        "replacement" : "$1"
      }
    ],
    "text" : "http://www.elastic.co"
}
// 正则执行结果
{
  "tokens" : [
    {
      "token" : "www.elastic.co", // 剔除了 http://
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

3.3 Token Filters

Token Filters 将 Tokenizer 输出的单词（term）进行增加、修改和删除。ElasticSearch 自带有 lowecase、stop 和 synonym 等 Token Filters。示例如下：

// 对 text 按 whitespace 分词后，再进行转小写和去除停用词的处理，例如去除 in、are、this等
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop"],
  "text": ["The gilrs in China are playing this game!"]
}
// 分析结果
{
  "tokens" : [
    {
      "token" : "gilrs",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "china",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "playing",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game!",
      "start_offset" : 36,
      "end_offset" : 41,
      "type" : "word",
      "position" : 7
    }
  ]
}

3.4 Custom Analyzer

可以在 settings 中设置自定义自己的分析器，示例如下：

// 定义自定义分析器 my_custom_analyzer
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard", // standard 分词处理
          "char_filter": [
            "html_strip" // 剔除 html 文本
          ],
          "filter": [
            "lowercase" // 分词转小写
          ]
        }
      }
    }
  }
}

// 使用自定义分析器
POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

// 分析结果
{
  "tokens" : [
    {
      "token" : "is",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "this",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "déjà",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "vu",
      "start_offset" : 16,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}