ElasticSearch-Multi Match Query

相关阅读：https://www.elastic.co/guide/en/elasticsearch/reference/7.4/query-dsl-multi-match-query.html

ElasticSearch 提供 multi_match 去查询匹配多个字段，例如检索字段 subject 和 message 中包含文本 this is a test的文档：

GET /_search
{
  "query": {
    "multi_match" : {
      "query":    "this is a test", 
      "fields": [ "subject", "message" ] 
    }
  }
}

multi_query 包含多种查询类型，这里我们介绍下如下三种：

最佳字段（best_fields）：当字段之间相互竞争，又相互关联时，评分来自最匹配的字段；
多数字段（most_fields）：处理英文内容时，一种常见的手段是，在主字段（english analyzer）抽取次干，加入同义词，已匹配更多的文档。相同的文本，加入子字段（standard analyzer）以提供更加精确的匹配。其他字段作为匹配文档提高相关度的信号。匹配字段越多则越好。
混合字段（cross_fields）：对于某些实体，例如人名、地址、图书信息。需要在多个字段中确定信息，单个字段只能作为整体的一部分。希望在任何这些列出的字段中找到尽可能多的词。

1. best_fields

best_fields 是默认指定的类型，可以不需要显式的指定。下面我们下创建如下索引和文档：

PUT /blogs/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /blogs/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

POST blogs/_doc/3
{
  "title" : "Luick brown rabbits",
  "body" : "Brown rabbits are commonly seen. Quick and pets"
}

通过如下检索，我们可以检索出最匹配的字段的文档，其效果跟 dis_max 类似。其中文档 3 是最符合条件的，因为 body 字段既包含词项 quick 和 pets，其次是文档 2 最后才是文档 1。minimum_should_match 设置的值 20 % 表示命中 query 的分词个数的 20% 的文档才会被返回，比如 Quick pets 的分词个数为 2，那么此次检索至少命中 2 * 0.8 = 1.6，向下取整为 1 个分词的文档才会被返回。因为三个文档都至少包含一个词项 quick 或 pets，所以通过下述 DSL 则三个文档都会返回。当把 80% 改为 100% 时，此时则需同时命中词项 quick 或 pets 的文档才会返回，也就是只有文档 3 才会返回。

POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "best_fields",
      "query": "Quick pets",
      "fields": ["title","body"],
      "tie_breaker": 0.2,
      "minimum_should_match": "80%"
    }
  }
}

2. most_fields

先从一个示例入手，下面设置 title 索引的 title 的分词器为 english，然后在加入两个文档：

PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}
POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

下面我们通过 barking dogs 去检索，按照理解，文档 2 是存在完整匹配的文本的，其相关性得分应该更高，但是实际却是文档 1 的得分更高。其实是因为 title 字段使用的是 english 分词器，那么两个文档的索引列表中存储的词项都是 dog 和 bark，根据得分算法，title 越短其算分越高，那么文档 1 的得分就会比文档 2 更高。

GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}
// 响应，文档 1 的相关性得分更高
{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.42221838,
    "hits" : [
      {
        "_index" : "titles",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.42221838,
        "_source" : {
          "title" : "My dog barks"
        }
      },
      {
        "_index" : "titles",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.320886,
        "_source" : {
          "title" : "I see a lot of barking dogs on the road "
        }
      }
    ]
  }
}

解决办法则是重建下 mapping，为 title 增加一个 std 子字段，该子字段采用 standard 分词器，然后使用多字段匹配 tite 包括尽可能多的文档以此提升召回率，同时又使用字段 title.std 作为信号将相关度更高的文档置于结果顶部。

PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {"std": {"type": "text","analyzer": "standard"}}
      }
    }
  }
}

GET /titles/_search
{
 "query": {
    "multi_match": {
        "query":  "barking dogs",
        "type":   "most_fields",
        "fields": [ "title", "title.std" ]
    }
  }
}
// 响应
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.4569323,
    "hits" : [
      {
        "_index" : "titles",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.4569323,
        "_source" : {
          "title" : "I see a lot of barking dogs on the road "
        }
      },
      {
        "_index" : "titles",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.42221838,
        "_source" : {
          "title" : "My dog barks"
        }
      }
    ]
  }
}

3. cross_fields

cross_fields 支持进行跨字段搜索，例如下述文档中，我们想使用 China Smith 去查询 last_name 和 country 这两个字段，那么最好的检索则是去检索 last_name 字段包含 Smith 的，而 country 包含 China 的。

POST users/_doc/1
{
  "first_name": "Shoto",
  "last_name": "Smith",
  "country": "China"
}
POST users/_doc/2
{
  "first_name": "Will",
  "last_name": "Smith",
  "country": "Poland"
}
POST users/_doc/3
{
  "first_name": "Will",
  "last_name": "Smith",
  "country": "Hongkong(China)"
}

可以通过 cross_fields 进行实现，查询和响应如下所示：

POST users/_search
{
  "query": {
    "multi_match": {
      "query": "China Smith",
      "type": "cross_fields",
      "operator": "and", 
      "fields": ["last_name", "country"]
    }
  }
}
// 响应
{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.65707976,
    "hits" : [
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.65707976,
        "_source" : {
          "first_name" : "Shoto",
          "last_name" : "Smith",
          "country" : "China"
        }
      },
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.52372307,
        "_source" : {
          "first_name" : "Will",
          "last_name" : "Smith",
          "country" : "Hongkong(China)"
        }
      }
    ]
  }
}