0%

ElasticSearch-Multi Match Query

相关阅读:https://www.elastic.co/guide/en/elasticsearch/reference/7.4/query-dsl-multi-match-query.html

ElasticSearch 提供 multi_match 去查询匹配多个字段,例如检索字段 subjectmessage 中包含文本 this is a test的文档:

1
2
3
4
5
6
7
8
9
GET /_search
{
"query": {
"multi_match" : {
"query": "this is a test",
"fields": [ "subject", "message" ]
}
}
}

multi_query 包含多种查询类型,这里我们介绍下如下三种:

  • 最佳字段(best_fields):当字段之间相互竞争,又相互关联时,评分来自最匹配的字段;
  • 多数字段(most_fields):处理英文内容时,一种常见的手段是,在主字段(english analyzer)抽取次干,加入同义词,已匹配更多的文档。相同的文本,加入子字段(standard analyzer)以提供更加精确的匹配。其他字段作为匹配文档提高相关度的信号。匹配字段越多则越好。
  • 混合字段(cross_fields):对于某些实体,例如人名、地址、图书信息。需要在多个字段中确定信息,单个字段只能作为整体的一部分。希望在任何这些列出的字段中找到尽可能多的词。

1. best_fields

best_fields 是默认指定的类型,可以不需要显式的指定。下面我们下创建如下索引和文档:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PUT /blogs/_doc/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}

PUT /blogs/_doc/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}

POST blogs/_doc/3
{
"title" : "Luick brown rabbits",
"body" : "Brown rabbits are commonly seen. Quick and pets"
}

通过如下检索,我们可以检索出最匹配的字段的文档,其效果跟 dis_max 类似。其中文档 3 是最符合条件的,因为 body 字段既包含词项 quickpets,其次是文档 2 最后才是文档 1。minimum_should_match 设置的值 20 % 表示命中 query 的分词个数的 20% 的文档才会被返回,比如 Quick pets 的分词个数为 2,那么此次检索至少命中 2 * 0.8 = 1.6,向下取整为 1 个分词的文档才会被返回。因为三个文档都至少包含一个词项 quickpets,所以通过下述 DSL 则三个文档都会返回。当把 80% 改为 100% 时,此时则需同时命中词项 quickpets 的文档才会返回,也就是只有文档 3 才会返回。

1
2
3
4
5
6
7
8
9
10
11
12
POST blogs/_search
{
"query": {
"multi_match": {
"type": "best_fields",
"query": "Quick pets",
"fields": ["title","body"],
"tie_breaker": 0.2,
"minimum_should_match": "80%"
}
}
}

2. most_fields

先从一个示例入手,下面设置 title 索引的 title 的分词器为 english,然后在加入两个文档:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PUT /titles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english"
}
}
}
}
POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

下面我们通过 barking dogs 去检索,按照理解,文档 2 是存在完整匹配的文本的,其相关性得分应该更高,但是实际却是文档 1 的得分更高。其实是因为 title 字段使用的是 english 分词器,那么两个文档的索引列表中存储的词项都是 dogbark,根据得分算法,title 越短其算分越高,那么文档 1 的得分就会比文档 2 更高。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
GET titles/_search
{
"query": {
"match": {
"title": "barking dogs"
}
}
}
// 响应,文档 1 的相关性得分更高
{
"took" : 18,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.42221838,
"hits" : [
{
"_index" : "titles",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.42221838,
"_source" : {
"title" : "My dog barks"
}
},
{
"_index" : "titles",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.320886,
"_source" : {
"title" : "I see a lot of barking dogs on the road "
}
}
]
}
}

解决办法则是重建下 mapping,为 title 增加一个 std 子字段,该子字段采用 standard 分词器,然后使用多字段匹配 tite 包括尽可能多的文档以此提升召回率,同时又使用字段 title.std 作为信号将相关度更高的文档置于结果顶部。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
PUT /titles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {"std": {"type": "text","analyzer": "standard"}}
}
}
}
}

GET /titles/_search
{
"query": {
"multi_match": {
"query": "barking dogs",
"type": "most_fields",
"fields": [ "title", "title.std" ]
}
}
}
// 响应
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.4569323,
"hits" : [
{
"_index" : "titles",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.4569323,
"_source" : {
"title" : "I see a lot of barking dogs on the road "
}
},
{
"_index" : "titles",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.42221838,
"_source" : {
"title" : "My dog barks"
}
}
]
}
}

3. cross_fields

cross_fields 支持进行跨字段搜索,例如下述文档中,我们想使用 China Smith 去查询 last_name 和 country 这两个字段,那么最好的检索则是去检索 last_name 字段包含 Smith 的,而 country 包含 China 的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
POST users/_doc/1
{
"first_name": "Shoto",
"last_name": "Smith",
"country": "China"
}
POST users/_doc/2
{
"first_name": "Will",
"last_name": "Smith",
"country": "Poland"
}
POST users/_doc/3
{
"first_name": "Will",
"last_name": "Smith",
"country": "Hongkong(China)"
}

可以通过 cross_fields 进行实现,查询和响应如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
POST users/_search
{
"query": {
"multi_match": {
"query": "China Smith",
"type": "cross_fields",
"operator": "and",
"fields": ["last_name", "country"]
}
}
}
// 响应
{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.65707976,
"hits" : [
{
"_index" : "users",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.65707976,
"_source" : {
"first_name" : "Shoto",
"last_name" : "Smith",
"country" : "China"
}
},
{
"_index" : "users",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.52372307,
"_source" : {
"first_name" : "Will",
"last_name" : "Smith",
"country" : "Hongkong(China)"
}
}
]
}
}
------ 本文结束------