ES 不常用操作记录

使用嵌套字段排序

elasticsearch 可以对嵌套字段进行排序，但是需要注意一些细节。首先，需要在映射中定义字段为 nested 类型，否则 elasticsearch 会将它们视为普通的对象类型。其次，需要在排序请求中指定 nested_path 和 nested_filter 参数，以便 elasticsearch 知道如何处理嵌套文档。最后，需要在排序请求中指定 mode 参数，以便 elasticsearch 知道如何从数组或多值字段中选择一个值作为排序依据。 mode 参数可以有以下几种取值：

min：选择最小值。
max：选择最大值。
sum：使用所有值的和作为排序值。
avg：使用所有值的平均数作为排序值。
median：使用所有值的中位数作为排序值。

例如，如果有一个嵌套字段叫做 comments，其中包含了 user 和 rating 两个子字段，并且想要按照评论用户的评分平均数来降序排列文档，可以使用以下的请求：

http

GET /_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "comments.rating": {
        "order": "desc",
        "mode": "avg",
        "nested_path": "comments",
        "nested_filter": {
          "match_all": {}
        }
      }
    }
  ]
}

多个嵌套字段排序

可以在 sort 数组中添加多个排序条件，每个条件都需要指定 nested_path 和 nested_filter 参数。例如，上面的例子如果想要先按照评论用户的评分平均数降序排列，然后按照评论用户的姓名升序排列，可以使用以下的请求：

http

GET /_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "comments.rating": {
        "order": "desc",
        "mode": "avg",
        "nested_path": "comments",
        "nested_filter": {
          "match_all": {}
        }
      }
    },
    {
      "comments.user": {
        "order": "asc",
        "mode": "min",
        "nested_path": "comments",
        "nested_filter": {
          "match_all": {}
        }
      }
    }
  ]
}

注意，这种方式会对每个文档的所有嵌套文档进行排序，而不是对每个嵌套文档单独进行排序。如果想要对每个嵌套文档单独进行排序，需要使用 inner_hits 参数来返回嵌套文档，并在 inner_hits 中指定 sort 参数。

排序嵌套文档

使用 inner_hits 参数来返回和排序嵌套文档，需要在查询请求中指定 nested 查询，并在 nested 查询中指定 inner_hits 参数。inner_hits 参数可以接受一个对象，其中可以包含 sort、size、from、highlight 等子参数，用于控制返回和排序嵌套文档的方式。例如，如果想要返回每个文档的前三个评论用户，并按照评分降序排列，可以使用以下的请求：

http

GET /_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match_all": {}
      },
      "inner_hits": {
        "sort": [
          {
            "comments.rating": {
              "order": "desc"
            }
          }
        ],
        "size": 3
      }
    }
  }
}

注意，这种方式会对每个文档的所有嵌套文档进行过滤和排序，而不是对整个结果集进行过滤和排序。如果想要对整个结果集进行过滤和排序，需要使用 top hits 聚合，并在 top hits 聚合中指定 sort 参数。

script

script语句是elasticsearch中的一种功能，可以让在elasticsearch中执行自定义的表达式。例如，可以使用script来返回一个计算出来的字段值，或者为一个查询评估一个自定义的分数。 script语句通常用在过滤器（filter）的上下文中，可以根据提供的script来过滤文档。使用script语句可能会导致搜索速度变慢，所以要谨慎使用。

使用script 根据文档的某个字段值来计算一个新的字段值

可以使用script_fields参数来在查询结果中返回一个或多个script计算出来的字段。
可以使用doc-values，_source字段，或者stored_fields来在script中访问文档的字段值，每种方式都有各自的优缺点。
可以使用Painless作为script的语言，它是elasticsearch的默认脚本语言，安全、高效、并且提供了一种自然的语法。

示例:

假设有一个文档类型叫做product，它有两个字段：price和discount。在查询结果中返回一个新的字段叫做final_price，它是根据price和discount计算出来的。可以使用script_fields参数来实现这个需求，如下所示：
http
```
GET /_search
{
"query": {
    "match_all": {}
},
"script_fields": {
    "final_price": {
    "script": {
        "lang":   "painless",
        "source": "doc['price'].value * (1 - doc['discount'].value)"
    }
    }
}
}
```
其中使用了Painless脚本语言，并且使用了doc-values来访问文档的字段值。

在script中使用日期或时间相关的操作，可以使用Joda-Time库1，例如：

http

GET /_search
{
    "query": {
        "bool" : {
            "filter" : {
                "script" : {
                    "script" : {
                        "source": "(doc['num1'].value + doc['num2'].value) > params.param1",
                        "params" : {
                            "param1" : 5
                        }
                    }
                }
            }
        }
    }
}

其中script字段是用来定义script的内容和参数的，它可以是一个字符串或者一个对象。

`doc-values` vs `_source` vs `stored_fields`

doc-values是一种在磁盘上存储非分析字段值的列式结构，它可以优化排序和聚合操作。doc-values默认在除了text类型的字段开启。使用doc-values可以只读取需要的字段值，而不用加载整个文档的_source。
_source是一个特殊的字段，它存储了文档的原始JSON表示。_source默认开启，可以在查询时返回完整或部分文档内容。使用**_source过滤可以从_source字段中检索和返回指定的字段值**。
stored_fields是一种在磁盘上存储单个字段值的方式，它可以避免加载整个_source。stored_fields默认关闭，需要手动设置store为true。使用stored_fields可以从索引中直接获取指定的字段值。

示例(示例来源es官网): 假设我们有一个索引叫做blog，它有两个字段：title和content。title是一个keyword类型的字段，content是一个text类型的字段。我们可以这样定义这个索引：

http

PUT /blog
{
  "mappings": {
    "properties": {
      "title": {
        "type": "keyword",
        // 注意这里设置了 store为true
        "store": true
      },
      "content": {
        "type": "text"
      }
    }
  }
}

然后我们可以插入一些文档：

http

POST /blog/_bulk
{"index":{"_id":"1"}}
{"title":"Hello World","content":"This is my first blog post."}
{"index":{"_id":"2"}}
{"title":"Goodbye World","content":"This is my last blog post."}

现在我们可以使用script_fields来访问文档的字段值，有三种方式：

使用doc-values：这种方式只能访问非分析字段，也就是keyword类型的字段，例如title。我们可以这样写：

http

GET /blog/_search
{
"query": {
    "match_all": {}
},
"script_fields": {
    "doc_value_title": {
    "script": {
        "source": "doc['title'].value"
    }
    }
}
}

使用_source：这种方式可以访问任何字段，但是需要加载整个文档的_source。我们可以这样写：

http

GET /blog/_search
{
    "query": {
        "match_all": {}
    },
    "script_fields": {
        "_source_title_and_content": {
            "script": {
                // _source is a Map of field names and values
                // we can use dot notation to access nested fields
                // we can also use brackets and quotes to access fields with special characters or spaces
                // for example: _source['user.name']
                // we can also use methods like get() or containsKey() on the _source Map object
                // for example: _source.get('user').get('name')
                // see https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-api-reference-shared-java-lang.html#painless-api-reference-shared-Map for more details on Map methods
                
                // here we concatenate the title and content fields with a space in between
                // note that we need to cast the values to String type before concatenating them
                
                // this script will return something like:
                // {"_source_title_and_content":"Hello World This is my first blog post."}
                
                "source": "(String)_source.title + ' ' + (String)_source.content"
            }
        }
    }
}

使用stored_fields：这种方式只能访问设置了store为true的字段，例如title。我们可以这样写：

http

GET /blog/_search
{
    "query": {
        "match_all": {}
    },
    "script_fields": {
        "stored_field_title": {
            "script": {
                // stored_fields is a List of field values
                // we can use methods like get() or size() on the stored_fields List object
                // for example: stored_fields.get(0)
                // see https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-api-reference-shared-java-util.html#painless-api-reference-shared-List for more details on List methods
                // here we just return the first element of the stored_fields List, which is the title value
                // this script will return something like:
                // {"stored_field_title":"Hello World"}
                
                "source": "stored_fields.get(0)"
            }
        }
    }
}

基于script的排序

示例:

根据一个字符串字段的长度来排序，可以使用script语句来计算每个文档的字符串长度，并将其作为排序值。

http

GET /_search
{
    "sort" : [
        {
            "_script" : {
                "type" : "number",
                "script" : {
                    return doc['field_name'].value.length();
                },
                "order" : "desc"
            }
        }
    ]
}

ingrest pipelines

ingest pipelines是一种在索引文档之前对它们进行预处理的机制，可以用来清洗、转换或丰富数据。可以通过Kibana的界面或者API来创建和管理ingest pipelines。一个ingest pipeline包含一系列可配置的任务，叫做processors。每个processor按顺序执行，对输入的文档做一些特定的修改。修改后的文档会被放入数据流或者索引中。根据文档中的字段值或者元数据来设置条件，让processor有选择地执行。

创建一个 ingest pipeline

通过Kibana的界面，点击Stack Management > Ingest Pipelines > Create pipeline > New pipeline，然后输入pipeline的名称和描述，添加processors，并测试pipeline。

通过API，发送PUT请求到_ingest/pipeline/<pipeline_id>，在请求体中指定processors的数组。示例:创建了一个名为my-pipeline的ingest pipeline，它有两个processors：一个将message字段转换为小写，另一个将message字段分割为单词数组。

http

PUT _ingest/pipeline/my-pipeline
{
    "description" : "describe pipeline",
    "processors" : [
        {
            "lowercase" : {
                "field" : "message"
            }
        },
        {
            "split" : {
                "field" : "message",
                "separator": "\\s"
            }
        }
    ]
}

使用一个 ingest pipeline

使用一个ingest pipeline有几种方法：

在索引文档时，指定pipeline参数，例如PUT /my-index/_doc/1?pipeline=my-pipeline。

在创建或更新索引模板时，指定default_pipeline参数，例如

http

PUT _index_template/my-template 
{ 
    "index_patterns": ["my-index-*"], 
    "template": { 
        "settings": { 
            "default_pipeline": "my-pipeline" 
        } 
    } 
}

在使用Elastic Beat场景下，在<BEAT_NAME>.yml文件中，指定output.elasticsearch.pipeline参数，例如:
yaml
```
output.elasticsearch: 
    hosts: ["localhost:9200"] 
    pipeline: my-pipeline2
```

删除一个 ingest pipeline

通过API，发送DELETE请求到_ingest/pipeline/<pipeline_id>，例如
http
```
DELETE _ingest/pipeline/my-pipeline1
```

注意，如果index template或者Elastic Beat使用了这个pipeline，需要先更新或者删除它们，否则会导致索引文档失败。

修改一个 ingest pipeline(同创建)

通过API，发送PUT请求到_ingest/pipeline/<pipeline_id>，在请求体中指定新的processors的数组，例如

http

PUT _ingest/pipeline/my-pipeline 
{ 
    "description" : "describe pipeline", 
    "processors" : [ ... ] 
}

注意，修改一个pipeline会影响所有使用它的索引文档操作，所以建议先测试修改后的pipeline是否符合预期。

测试一个 ingest pipeline

通过Kibana的界面，点击Stack Management > Ingest Pipelines > Create pipeline > New pipeline，然后输入pipeline的名称和描述，添加processors，并点击Test pipeline按钮，输入测试文档，查看输出结果。

通过API，发送POST请求到_ingest/pipeline/<pipeline_id>/_simulate，在请求体中指定docs的数组，每个doc包含_source字段，例如

http

POST _ingest/pipeline/my-pipeline/_simulate 
{ 
    "docs": [ 
        { 
            "_source": 
            { 
                "message": "Hello World" 
            } 
        } 
    ] 
}

ingrest pipeline 中常用的 processors

注意每个 processor 按顺序运行，对输入文档进行一定的修改。有许多不同类型的 processors，可以执行各种操作，例如添加、删除或重命名字段，提取或转换值，检测或解析结构化数据等。可以根据需求组合多个 processors 来创建自定义的 ingest pipeline。一些常用的 processors ：

append：向数组字段追加一个或多个值。
convert：将字段的数据类型转换为另一种类型。
date：将字符串字段解析为日期字段。
dissect：将字符串字段分割为多个子字段。
drop：删除整个文档。
enrich：使用 enrich 策略中定义的数据来丰富文档。
geoip：根据 IP 地址字段添加地理位置信息。
grok：使用 grok 表达式来解析非结构化文本并提取值。
json：将 JSON 字符串解析为对象并存储到目标字段中。
lowercase：将字符串字段转换为小写形式。
remove：删除一个或多个字段。
rename：重命名一个或多个字段。
set：设置一个新的值到一个已存在或不存在的字段上。参考还可以使用 pipeline processor 来调用另一个 ingest pipeline。 (如果想要使用 Beats 来发送数据到 elasticsearch ingest pipeline，需要在 Beats 的配置文件中指定 pipeline 参数。)

_cat API

_cat API 是一组用于以紧凑和对齐的文本格式返回集群相关信息的 API。它们的目的是为了方便人类使用 Kibana 控制台或命令行来查看和监控集群的状态。它们不适合应用程序使用，如果需要在应用程序中获取相同的信息，使用相应的 JSON API。

常用 _cat API

_cat/indices：返回集群中索引（包括数据流的后备索引）的高级信息。
_cat/nodes：返回集群节点的信息。
_cat/health：返回集群健康状态。
_cat/allocation：返回集群分片分配情况。
_cat/shards：返回集群分片详细信息。
_cat/master：返回当前主节点信息。

ES 不常用操作记录 #

使用嵌套字段排序 #

多个嵌套字段排序 #

排序嵌套文档 #

script #

使用script 根据文档的某个字段值来计算一个新的字段值 #

doc-values vs _source vs stored_fields #

基于script的排序 #

ingrest pipelines #

创建一个 ingest pipeline #

使用一个 ingest pipeline #

删除一个 ingest pipeline #

修改一个 ingest pipeline(同创建) #

测试一个 ingest pipeline #

ingrest pipeline 中常用的 processors #

_cat API #

常用 _cat API #

ES 不常用操作记录

使用嵌套字段排序

多个嵌套字段排序

排序嵌套文档

script

使用script 根据文档的某个字段值来计算一个新的字段值

`doc-values` vs `_source` vs `stored_fields`

基于script的排序

ingrest pipelines

创建一个 ingest pipeline

使用一个 ingest pipeline

删除一个 ingest pipeline

修改一个 ingest pipeline(同创建)

测试一个 ingest pipeline

ingrest pipeline 中常用的 processors

_cat API

常用 _cat API