elasticsearch 分词
 编辑于 2022-11-21 21:49:36 阅读 1785
安装中文、拼音分词
https://github.com/medcl/elasticsearch-analysis-ik
https://github.com/medcl/elasticsearch-analysis-pinyin
下载和elasticsearch对应的版本,解压后移到plugins目录
root@57d58faf9b1e:/usr/share/elasticsearch/plugins# ls
ik  pinyin
重启elasticsearch使生效
测试一下
默认分词
curl -H "Content-Type: application/json" -XPOST 'localhost:9200/_analyze?pretty' -d'
{
  "analyzer": "standard",
  "text":"22强烈推荐11"
}'
ik中文分词
curl -H "Content-Type: application/json" -XPOST 'localhost:9200/_analyze?pretty' -d'
{
  "analyzer": "ik_max_word",
  "text":"22强烈推荐11"
}'
拼音分词
curl -H "Content-Type: application/json" -XPOST 'localhost:9200/_analyze?pretty' -d'
{
  "analyzer": "pinyin",
  "text":"22强烈推荐11"
}'
创建索引article,内容如下
{
  "settings": {
    "index":{
      "number_of_shards": "1",
      "number_of_replicas": "0",
      "analysis" : {
        "analyzer" : {
          "default" : {
            "tokenizer" : "ik_max_word"
          },
          "pinyin_analyzer" : {
            "tokenizer" : "my_pinyin"
          }
        },
        "tokenizer" : {
          "my_pinyin" : {
            "keep_separate_first_letter" : "false",
            "lowercase" : "true",
            "type" : "pinyin",
            "limit_first_letter_length" : "16",
            "keep_original" : "true",
            "keep_full_pinyin" : "true"
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer":"ik_max_word",
        "fields" : {
          "pinyin" : {
            "type" : "text",
            "term_vector" : "with_positions_offsets",
            "analyzer" : "pinyin_analyzer",
            "boost" : 10.0
          }
        }
      },
      "content": {
        "type": "text",
        "analyzer":"ik_max_word"
      },
      "create_time": {
        "type": "long"
      },
      "id": {
        "type": "long"
      },
      "update_time": {
        "type": "long"
      }
    }
  }
}
php
导入数据后,就可以测试了
    public function search($keyword, $page=1, $max=10) {
        $params = [
            'index' => 'article',
            'body' => [
                'query' => ['multi_match' => ['query' => $keyword, 'fields'=>['title', 'title.pinyin', 'content']]],
                '_source'=>['id', 'title', 'content', 'create_time'],
                'highlight'=>['fields'=>['title'=>new \stdClass(), 'content'=>new \stdClass()]],
                "sort"=>['_doc'],
                'from'=>($page-1)*$max,//from, size相当于sql的limit
                'size'=>$max,
            ]
        ];
        return $this->cache()->search($params);
    }
进阶
自定义分词词典
//在ik的配置目录增加my.dic
echo '朝阳公园'>./elasticsearch/plugins/ik/config/my.dic
//加载自定义词典
vi ./elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml
...
<entry key="ext_dict">my.dic</entry>
...
//最后,重启es即可
另外,我们看到配置里还有个扩展停止词字典,这个是用来辅助断句的。我们可以看一下自带的一个扩展停止词字典:
$ head -n 5 extra_stopword.dic
也
了
仍
从
以
也就是IK分词器遇到这些词就认为前面的词语不会与这些词构成词语。
IK分词也支持远程词典,远程词典的好处是支持热更新。词典格式和本地的一致,都是一行一个分词(换行符用 \n),还要求填写的URL满足:
该 http 请求需要返回两个头部(header),一个是 Last-Modified,一个是 ETag,这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。
