精品久久久久久无码人妻,九九综合九色综合网站,а√最新版天堂资源在线

轉載出處：https://zhuanlan.zhihu.com/p/29183128
介紹：ElasticSearch 是一個基于 Lucene 的搜索服務器。它提供了一個分布式多用戶能力的全文搜索引擎，基于 RESTful web 接口。Elasticsearch 是用 Java 開發(fā)的，并作為Apache許可條款下的開放源碼發(fā)布，是當前流行的企業(yè)級搜索引擎。設計用于云計算中，能夠達到實時搜索，穩(wěn)定，可靠，快速，安裝使用方便。

Elasticsearch中，內置了很多分詞器（analyzers）。下面來進行比較下系統(tǒng)默認分詞器和常用的中文分詞器之間的區(qū)別。

系統(tǒng)默認分詞器：

1、standard 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

如何使用：http://www.yiibai.com/lucene/lucene_standardanalyzer.html

英文的處理能力同于StopAnalyzer.支持中文采用的方法為單字切分。他會將詞匯單元轉換成小寫形式，并去除停用詞和標點符號。

/**StandardAnalyzer分析器*/
public void standardAnalyzer(String msg){
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
   this.getTokens(analyzer, msg);
}

2、simple 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html

如何使用: http://www.yiibai.com/lucene/lucene_simpleanalyzer.html

功能強于WhitespaceAnalyzer, 首先會通過非字母字符來分割文本信息，然后將詞匯單元統(tǒng)一為小寫形式。該分析器會去掉數(shù)字類型的字符。

/**SimpleAnalyzer分析器*/
    public void simpleAnalyzer(String msg){
        SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_36);
        this.getTokens(analyzer, msg);
    }

3、Whitespace 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html

如何使用：http://www.yiibai.com/lucene/lucene_whitespaceanalyzer.html

僅僅是去除空格，對字符沒有l(wèi)owcase化,不支持中文；并且不對生成的詞匯單元進行其他的規(guī)范化處理。

/**WhitespaceAnalyzer分析器*/
    public void whitespaceAnalyzer(String msg){
        WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_36);
        this.getTokens(analyzer, msg);
    }

4、Stop 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html

如何使用：http://www.yiibai.com/lucene/lucene_stopanalyzer.html

StopAnalyzer的功能超越了SimpleAnalyzer，在SimpleAnalyzer的基礎上增加了去除英文中的常用單詞（如the，a等），也可以更加自己的需要設置常用單詞；不支持中文

/**StopAnalyzer分析器*/
   public void stopAnalyzer(String msg){
       StopAnalyzer analyzer = new StopAnalyzer(Version.LUCENE_36);
       this.getTokens(analyzer, msg);
   }

5、keyword 分詞器

KeywordAnalyzer把整個輸入作為一個單獨詞匯單元，方便特殊類型的文本進行索引和檢索。針對郵政編碼，地址等文本信息使用關鍵詞分詞器進行索引項建立非常方便。

6、pattern 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html

一個pattern類型的analyzer可以通過正則表達式將文本分成"terms"(經(jīng)過token Filter 后得到的東西 )。接受如下設置:

一個 pattern analyzer 可以做如下的屬性設置:

lowercaseterms是否是小寫. 默認為 true 小寫.pattern正則表達式的pattern, 默認是 \W+.flags正則表達式的flagsstopwords一個用于初始化stop filter的需要stop 單詞的列表.默認單詞是空的列表

7、language 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

一個用于解析特殊語言文本的analyzer集合。（ arabic,armenian, basque, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french,galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian,persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.）可惜沒有中文。不予考慮

8、snowball 分詞器

一個snowball類型的analyzer是由standard tokenizer和standard filter、lowercase filter、stop filter、snowball filter這四個filter構成的。

snowball analyzer 在Lucene中通常是不推薦使用的。

9、Custom 分詞器

是自定義的analyzer。允許多個零到多個tokenizer，零到多個 Char Filters. custom analyzer 的名字不能以 "_"開頭.

The following are settings that can be set for a custom analyzer type:

SettingDescriptiontokenizer通用的或者注冊的tokenizer.filter通用的或者注冊的token filterschar_filter通用的或者注冊的 character filtersposition_increment_gap距離查詢時，最大允許查詢的距離，默認是100

自定義的模板：

index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : myTokenizer1
                filter : [myTokenFilter1, myTokenFilter2]
                char_filter : [my_html]
                position_increment_gap: 256
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900
        filter :
            myTokenFilter1 :
                type : stop
                stopwords : [stop1, stop2, stop3, stop4]
            myTokenFilter2 :
                type : length
                min : 0
                max : 2000
        char_filter :
              my_html :
                type : html_strip
                escaped_tags : [xxx, yyy]
                read_ahead : 1024

10、fingerprint 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html

中文分詞器：

1、ik-analyzer

https://github.com/wks/ik-analyzer

IKAnalyzer是一個開源的，基于java語言開發(fā)的輕量級的中文分詞工具包。

采用了特有的“正向迭代最細粒度切分算法“，支持細粒度和最大詞長兩種切分模式；具有83萬字/秒（1600KB/S）的高速處理能力。

采用了多子處理器分析模式，支持：英文字母、數(shù)字、中文詞匯等分詞處理，兼容韓文、日文字符

優(yōu)化的詞典存儲，更小的內存占用。支持用戶詞典擴展定義

針對Lucene全文檢索優(yōu)化的查詢分析器IKQueryParser(作者吐血推薦)；引入簡單搜索表達式，采用歧義分析算法優(yōu)化查詢關鍵字的搜索排列組合，能極大的提高Lucene檢索的命中率。

Maven用法：

<dependency>
    <groupId>org.wltea.ik-analyzer</groupId>
    <artifactId>ik-analyzer</artifactId>
    <version>3.2.8</version>
</dependency>

在IK Analyzer加入Maven Central Repository之前，你需要手動安裝，安裝到本地的repository，或者上傳到自己的Maven repository服務器上。

要安裝到本地Maven repository，使用如下命令，將自動編譯，打包并安裝： mvn install -Dmaven.test.skip=true

Elasticsearch添加中文分詞

安裝IK分詞插件

https://github.com/medcl/elasticsearch-analysis-ik

進入elasticsearch-analysis-ik-master

2、如何在Elasticsearch中安裝中文分詞器(IK+pinyin)：http://www.cnblogs.com/xing901022/p/5910139.html

3、Elasticsearch 中文分詞器 IK 配置和使用： http://blog.csdn.net/jam00/article/details/52983056

ik 帶有兩個分詞器

ik_max_word：會將文本做最細粒度的拆分；盡可能多的拆分出詞語

ik_smart：會做最粗粒度的拆分；已被分出的詞語將不會再次被其它詞語占有

區(qū)別：

# ik_max_word

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '聯(lián)想是全球最大的筆記本廠商'
#返回

{
  "tokens" : [
    {
      "token" : "聯(lián)想",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "最大",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "筆記本",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "筆記",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "本廠",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "廠商",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    }
  ]
}

# ik_smart

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '聯(lián)想是全球最大的筆記本廠商'

# 返回

{
  "tokens" : [
    {
      "token" : "聯(lián)想",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "最大",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "筆記本",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "廠商",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

下面我們來創(chuàng)建一個索引，使用 ik 創(chuàng)建一個名叫 iktest 的索引，設置它的分析器用 ik ，分詞器用 ik_max_word，并創(chuàng)建一個 article 的類型，里面有一個 subject 的字段，指定其使用 ik_max_word 分詞器

curl -XPUT 'http://localhost:9200/iktest?pretty' -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                }
            }
        }
    },
    "mappings" : {
        "article" : {
            "dynamic" : true,
            "properties" : {
                "subject" : {
                    "type" : "string",
                    "analyzer" : "ik_max_word"
                }
            }
        }
    }
}'

批量添加幾條數(shù)據(jù)，這里我指定元數(shù)據(jù) _id 方便查看，subject 內容為我隨便找的幾條新聞的標題

curl -XPOST http://localhost:9200/iktest/article/_bulk?pretty -d '
{ "index" : { "_id" : "1" } }
{"subject" : "＂閨蜜＂崔順實被韓檢方傳喚 韓總統(tǒng)府促徹查真相" }
{ "index" : { "_id" : "2" } }
{"subject" : "韓舉行＂護國訓練＂ 青瓦臺:決不許國家安全出問題" }
{ "index" : { "_id" : "3" } }
{"subject" : "媒體稱FBI已經(jīng)取得搜查令 檢視希拉里電郵" }
{ "index" : { "_id" : "4" } }
{"subject" : "村上春樹獲安徒生獎 演講中談及歐洲排外問題" }
{ "index" : { "_id" : "5" } }
{"subject" : "希拉里團隊炮轟FBI 參院民主黨領袖批其“違法”" }
'

查詢 “希拉里和韓國”

curl -XPOST http://localhost:9200/iktest/article/_search?pretty  -d'
{
    "query" : { "match" : { "subject" : "希拉里和韓國" }},
    "highlight" : {
        "pre_tags" : ["<font color='red'>"],
        "post_tags" : ["</font>"],
        "fields" : {
            "subject" : {}
        }
    }
}
'
#返回
{
  "took" : 113,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.034062363,
    "hits" : [ {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "2",
      "_score" : 0.034062363,
      "_source" : {
        "subject" : "韓舉行＂護國訓練＂ 青瓦臺:決不許國家安全出問題"
      },
      "highlight" : {
        "subject" : [ "<font color=red>韓</font>舉行＂護<font color=red>國</font>訓練＂ 青瓦臺:決不許國家安全出問題" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "3",
      "_score" : 0.0076681254,
      "_source" : {
        "subject" : "媒體稱FBI已經(jīng)取得搜查令 檢視希拉里電郵"
      },
      "highlight" : {
        "subject" : [ "媒體稱FBI已經(jīng)取得搜查令 檢視<font color=red>希拉里</font>電郵" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "5",
      "_score" : 0.006709609,
      "_source" : {
        "subject" : "希拉里團隊炮轟FBI 參院民主黨領袖批其“違法”"
      },
      "highlight" : {
        "subject" : [ "<font color=red>希拉里</font>團隊炮轟FBI 參院民主黨領袖批其“違法”" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "1",
      "_score" : 0.0021509775,
      "_source" : {
        "subject" : "＂閨蜜＂崔順實被韓檢方傳喚 韓總統(tǒng)府促徹查真相"
      },
      "highlight" : {
        "subject" : [ "＂閨蜜＂崔順實被<font color=red>韓</font>檢方傳喚 <font color=red>韓</font>總統(tǒng)府促徹查真相" ]
      }
    } ]
  }
}

這里用了高亮屬性 highlight，直接顯示到 html 中，被匹配到的字或詞將以紅色突出顯示。若要用過濾搜索，直接將 match 改為 term 即可

熱詞更新配置

網(wǎng)絡詞語日新月異，如何讓新出的網(wǎng)絡熱詞（或特定的詞語）實時的更新到我們的搜索當中呢

先用 ik 測試一下

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '
成龍原名陳港生
'
#返回
{
  "tokens" : [ {
    "token" : "成龍",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "原名",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "陳",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "港",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "生",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "CN_CHAR",
    "position" : 4
  } ]
}

ik 的主詞典中沒有”陳港生” 這個詞，所以被拆分了。現(xiàn)在我們來配置一下

修改 IK 的配置文件：ES 目錄/plugins/ik/config/ik/IKAnalyzer.cfg.xml

修改如下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴展配置</comment>
    <!--用戶可以在這里配置自己的擴展字典 -->
    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
     <!--用戶可以在這里配置自己的擴展停止詞字典-->
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
    <!--用戶可以在這里配置遠程擴展字典 -->
    <entry key="remote_ext_dict">http://192.168.1.136/hotWords.php</entry>
    <!--用戶可以在這里配置遠程擴展停止詞字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

這里我是用的是遠程擴展字典，因為可以使用其他程序調用更新，且不用重啟 ES，很方便；當然使用自定義的 mydict.dic 字典也是很方便的，一行一個詞，自己加就可以了

既然是遠程詞典，那么就要是一個可訪問的鏈接，可以是一個頁面，也可以是一個txt的文檔，但要保證輸出的內容是 utf-8 的格式

hotWords.php 的內容

$s = <<<'EOF'
陳港生
元樓
藍瘦
EOF;
header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200);
header('ETag: "5816f349-19"');
echo $s;

ik 接收兩個返回的頭部屬性 Last-Modified 和 ETag，只要其中一個有變化，就會觸發(fā)更新，ik 會每分鐘獲取一次重啟 Elasticsearch ，查看啟動記錄，看到了三個詞已被加載進來

再次執(zhí)行上面的請求，返回, 就可以看到 ik 分詞器已經(jīng)匹配到了 “陳港生” 這個詞，同理一些關于我們公司的專有名字（例如：永輝、永輝超市、永輝云創(chuàng)、云創(chuàng) .... ）也可以自己手動添加到字典中去。

2、結巴中文分詞

特點：

1、支持三種分詞模式：

精確模式，試圖將句子最精確地切開，適合文本分析；
全模式，把句子中所有的可以成詞的詞語都掃描出來, 速度非常快，但是不能解決歧義；
搜索引擎模式，在精確模式的基礎上，對長詞再次切分，提高召回率，適合用于搜索引擎分詞。

2、支持繁體分詞

3、支持自定義詞典

3、THULAC

THULAC（THU Lexical Analyzer for Chinese）由清華大學自然語言處理與社會人文計算實驗室研制推出的一套中文詞法分析工具包，具有中文分詞和詞性標注功能。THULAC具有如下幾個特點：

能力強。利用我們集成的目前世界上規(guī)模最大的人工分詞和詞性標注中文語料庫（約含5800萬字）訓練而成，模型標注能力強大。

準確率高。該工具包在標準數(shù)據(jù)集Chinese Treebank（CTB5）上分詞的F1值可達97.3％，詞性標注的F1值可達到92.9％，與該數(shù)據(jù)集上最好方法效果相當。

速度較快。同時進行分詞和詞性標注速度為300KB/s，每秒可處理約15萬字。只進行分詞速度可達到1.3MB/s。

中文分詞工具thulac4j發(fā)布

1、規(guī)范化分詞詞典，并去掉一些無用詞；

2、重寫DAT（雙數(shù)組Trie樹）的構造算法，生成的DAT size減少了8%左右，從而節(jié)省了內存；

3、優(yōu)化分詞算法，提高了分詞速率。

<dependency>
  <groupId>io.github.yizhiru</groupId>
  <artifactId>thulac4j</artifactId>
  <version>${thulac4j.version}</version>
</dependency>

http://www.cnblogs.com/en-heng/p/6526598.html

thulac4j支持兩種分詞模式：

SegOnly模式，只分詞沒有詞性標注；

SegPos模式，分詞兼有詞性標注。

// SegOnly mode
String sentence = "滔滔的流水，向著波士頓灣無聲逝去";
SegOnly seg = new SegOnly("models/seg_only.bin");
System.out.println(seg.segment(sentence));
// [滔滔, 的, 流水, ，, 向著, 波士頓灣, 無聲, 逝去]

// SegPos mode
SegPos pos = new SegPos("models/seg_pos.bin");
System.out.println(pos.segment(sentence));
//[滔滔/a, 的/u, 流水/n, ，/w, 向著/p, 波士頓灣/ns, 無聲/v, 逝去/v]

4、NLPIR

中科院計算所 NLPIR：http://ictclas.nlpir.org/nlpir/ (可直接在線分析中文)

下載地址：https://github.com/NLPIR-team/NLPIR

中科院分詞系統(tǒng)(NLPIR)JAVA簡易教程: http://www.cnblogs.com/wukongjiuwo/p/4092480.html

5、ansj分詞器

https://github.com/NLPchina/ansj_seg

這是一個基于n-Gram+CRF+HMM的中文分詞的java實現(xiàn).

分詞速度達到每秒鐘大約200萬字左右（mac air下測試），準確率能達到96%以上

目前實現(xiàn)了.中文分詞. 中文姓名識別 .

用戶自定義詞典,關鍵字提取，自動摘要，關鍵字標記等功能可以應用到自然語言處理等方面,適用于對分詞效果要求高的各種項目.

maven 引入：

<dependency>
            <groupId>org.ansj</groupId>
            <artifactId>ansj_seg</artifactId>
            <version>5.1.1</version>
</dependency>

調用demo

String str = "歡迎使用ansj_seg,(ansj中文分詞)在這里如果你遇到什么問題都可以聯(lián)系我.我一定盡我所能.幫助大家.ansj_seg更快,更準,更自由!" ;
 System.out.println(ToAnalysis.parse(str));

 歡迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分詞/n,),在/p,這里/r,如果/c,你/r,遇到/v,什么/r,問題/n,都/d,可以/v,聯(lián)系/v,我/r,./m,我/r,一定/d,盡我所能/l,./m,幫助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,準/a,,,更/d,自由/a,!

6、哈工大的LTP

https://github.com/HIT-SCIR/ltp

LTP制定了基于XML的語言處理結果表示，并在此基礎上提供了一整套自底向上的豐富而且高效的中文語言處理模塊（包括詞法、句法、語義等6項中文處理核心技術），以及基于動態(tài)鏈接庫（Dynamic Link Library, DLL）的應用程序接口、可視化工具，并且能夠以網(wǎng)絡服務（Web Service）的形式進行使用。

關于LTP的使用，請參考: http://ltp.readthedocs.io/zh_CN/latest/

7、庖丁解牛

下載地址：http://pan.baidu.com/s/1eQ88SZS

使用分為如下幾步：

配置dic文件：修改paoding-analysis.jar中的paoding-dic-home.properties文件，將“#paoding.dic.home=dic”的注釋去掉，并配置成自己dic文件的本地存放路徑。eg：/home/hadoop/work/paoding-analysis-2.0.4-beta/dic
把Jar包導入到項目中：將paoding-analysis.jar、commons-logging.jar、lucene-analyzers-2.2.0.jar和lucene-core-2.2.0.jar四個包導入到項目中，這時就可以在代碼片段中使用庖丁解牛工具提供的中文分詞技術，例如：

Analyzer analyzer = new PaodingAnalyzer(); //定義一個解析器
String text = "庖丁系統(tǒng)是個完全基于lucene的中文分詞系統(tǒng)，它就是重新建了一個analyzer，叫做PaodingAnalyzer，這個analyer的核心任務就是生成一個可以切詞TokenStream。"; <span style="font-family: Arial, Helvetica, sans-serif;">//待分詞的內容</span>
TokenStream tokenStream = analyzer.tokenStream(text, new StringReader(text)); //得到token序列的輸出流
try {
    Token t;
    while ((t = tokenStream.next()) != null)
    {
           System.out.println(t); //輸出每個token
    }
} catch (IOException e) {
    e.printStackTrace();
}

8、sogo在線分詞

sogo在線分詞采用了基于漢字標注的分詞方法，主要使用了線性鏈鏈CRF（Linear-chain CRF）模型。詞性標注模塊主要基于結構化線性模型（Structured Linear Model）

在線使用地址為： http://www.sogou.com/labs/webservice/

9、word分詞

地址： https://github.com/ysc/word

word分詞是一個Java實現(xiàn)的分布式的中文分詞組件，提供了多種基于詞典的分詞算法，并利用ngram模型來消除歧義。能準確識別英文、數(shù)字，以及日期、時間等數(shù)量詞，能識別人名、地名、組織機構名等未登錄詞。能通過自定義配置文件來改變組件行為，能自定義用戶詞庫、自動檢測詞庫變化、支持大規(guī)模分布式環(huán)境，能靈活指定多種分詞算法，能使用refine功能靈活控制分詞結果，還能使用詞頻統(tǒng)計、詞性標注、同義標注、反義標注、拼音標注等功能。提供了10種分詞算法，還提供了10種文本相似度算法，同時還無縫和Lucene、Solr、ElasticSearch、Luke集成。注意：word1.3需要JDK1.8

maven 中引入依賴：

<dependencies>
    <dependency>
        <groupId>org.apdplat</groupId>
        <artifactId>word</artifactId>
        <version>1.3</version>
    </dependency>
</dependencies>

ElasticSearch插件：

1、打開命令行并切換到elasticsearch的bin目錄
cd elasticsearch-2.1.1/bin

2、運行plugin腳本安裝word分詞插件：
./plugin install http://apdplat.org/word/archive/v1.4.zip

安裝的時候注意：
    如果提示：
        ERROR: failed to download
    或者
        Failed to install word, reason: failed to download
    或者
        ERROR: incorrect hash (SHA1)
    則重新再次運行命令，如果還是不行，多試兩次

如果是elasticsearch1.x系列版本，則使用如下命令：
./plugin -u http://apdplat.org/word/archive/v1.3.1.zip -i word

3、修改文件elasticsearch-2.1.1/config/elasticsearch.yml，新增如下配置：
index.analysis.analyzer.default.type : "word"
index.analysis.tokenizer.default.type : "word"

4、啟動ElasticSearch測試效果，在Chrome瀏覽器中訪問：
http://localhost:9200/_analyze?analyzer=word&text=楊尚川是APDPlat應用級產(chǎn)品開發(fā)平臺的作者

5、自定義配置
修改配置文件elasticsearch-2.1.1/plugins/word/word.local.conf

6、指定分詞算法
修改文件elasticsearch-2.1.1/config/elasticsearch.yml，新增如下配置：
index.analysis.analyzer.default.segAlgorithm : "ReverseMinimumMatching"
index.analysis.tokenizer.default.segAlgorithm : "ReverseMinimumMatching"

這里segAlgorithm可指定的值有：
正向最大匹配算法：MaximumMatching
逆向最大匹配算法：ReverseMaximumMatching
正向最小匹配算法：MinimumMatching
逆向最小匹配算法：ReverseMinimumMatching
雙向最大匹配算法：BidirectionalMaximumMatching
雙向最小匹配算法：BidirectionalMinimumMatching
雙向最大最小匹配算法：BidirectionalMaximumMinimumMatching
全切分算法：FullSegmentation
最少詞數(shù)算法：MinimalWordCount
最大Ngram分值算法：MaxNgramScore
如不指定，默認使用雙向最大匹配算法：BidirectionalMaximumMatching

10、jcseg分詞器

https://code.google.com/archive/p/jcseg/

11、stanford分詞器

Stanford大學的一個開源分詞工具，目前已支持漢語。

首先，去【1】下載Download Stanford Word Segmenter version 3.5.2，取得里面的 data 文件夾，放在maven project的 src/main/resources 里。

然后，maven依賴添加：

<properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <corenlp.version>3.6.0</corenlp.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
            <classifier>models</classifier>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
            <classifier>models-chinese</classifier>
        </dependency>
    </dependencies>

測試：

import java.util.Properties;

import edu.stanford.nlp.ie.crf.CRFClassifier;

public class CoreNLPSegment {

    private static CoreNLPSegment instance;
    private CRFClassifier         classifier;

    private CoreNLPSegment(){
        Properties props = new Properties();
        props.setProperty("sighanCorporaDict", "data");
        props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
        props.setProperty("inputEncoding", "UTF-8");
        props.setProperty("sighanPostProcessing", "true");
        classifier = new CRFClassifier(props);
        classifier.loadClassifierNoExceptions("data/ctb.gz", props);
        classifier.flags.setProperties(props);
    }

    public static CoreNLPSegment getInstance() {
        if (instance == null) {
            instance = new CoreNLPSegment();
        }

        return instance;
    }

    public String[] doSegment(String data) {
        return (String[]) classifier.segmentString(data).toArray();
    }

    public static void main(String[] args) {

        String sentence = "他和我在學校里常打桌球。";
        String ret[] = CoreNLPSegment.getInstance().doSegment(sentence);
        for (String str : ret) {
            System.out.println(str);
        }

    }

}

博客：

https://blog.sectong.com/blog/corenlp_segment.html

http://blog.csdn.net/lightty/article/details/51766602

12、Smartcn

Smartcn為Apache2.0協(xié)議的開源中文分詞系統(tǒng)，Java語言編寫，修改的中科院計算所ICTCLAS分詞系統(tǒng)。很早以前看到Lucene上多了一個中文分詞的contribution，當時只是簡單的掃了一下.class文件的文件名，通過文件名可以看得出又是一個改的ICTCLAS的分詞系統(tǒng)。

http://lucene.apache.org/core/5_1_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html

13、pinyin 分詞器

pinyin分詞器可以讓用戶輸入拼音，就能查找到相關的關鍵詞。比如在某個商城搜索中，輸入yonghui，就能匹配到永輝。這樣的體驗還是非常好的。

pinyin分詞器的安裝與IK是一樣的。下載地址：https://github.com/medcl/elasticsearch-analysis-pinyin

一些參數(shù)請參考 GitHub 的 readme 文檔。

這個分詞器在1.8版本中，提供了兩種分詞規(guī)則：

pinyin,就是普通的把漢字轉換成拼音；
pinyin_first_letter，提取漢字的拼音首字母

使用：

1.Create a index with custom pinyin analyzer

curl -XPUT http://localhost:9200/medcl/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}'

2.Test Analyzer, analyzing a chinese name, such as 劉德華

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer

{
  "tokens" : [
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hua",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "劉德華",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    }
  ]
}

3.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
    "folks": {
        "properties": {
            "name": {
                "type": "keyword",
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "store": "no",
                        "term_vector": "with_offsets",
                        "analyzer": "pinyin_analyzer",
                        "boost": 10
                    }
                }
            }
        }
    }
}'

4.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"劉德華"}'

5.Let's search

http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua

6.Using Pinyin-TokenFilter

curl -XPUT http://localhost:9200/medcl1/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}'

Token Test:劉德華張學友郭富城黎明四大天王

curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer

{
  "tokens" : [
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zxy",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "gfc",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "lm",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "sdtw",
      "start_offset" : 15,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    }
  ]
}

7.Used in phrase query

(1)、

 PUT /medcl/
  {
      "index" : {
          "analysis" : {
              "analyzer" : {
                  "pinyin_analyzer" : {
                      "tokenizer" : "my_pinyin"
                      }
              },
              "tokenizer" : {
                  "my_pinyin" : {
                      "type" : "pinyin",
                      "keep_first_letter":false,
                      "keep_separate_first_letter" : false,
                      "keep_full_pinyin" : true,
                      "keep_original" : false,
                      "limit_first_letter_length" : 16,
                      "lowercase" : true
                  }
              }
          }
      }
  }
  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "劉德華"
    }}
  }

(2)、

PUT /medcl/
  {
      "index" : {
          "analysis" : {
              "analyzer" : {
                  "pinyin_analyzer" : {
                      "tokenizer" : "my_pinyin"
                      }
              },
              "tokenizer" : {
                  "my_pinyin" : {
                      "type" : "pinyin",
                      "keep_first_letter":false,
                      "keep_separate_first_letter" : true,
                      "keep_full_pinyin" : false,
                      "keep_original" : false,
                      "limit_first_letter_length" : 16,
                      "lowercase" : true
                  }
              }
          }
      }
  }

  POST /medcl/folks/andy
  {"name":"劉德華"}

  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "劉德h"
    }}
  }

  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "劉dh"
    }}
  }

  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "dh"
    }}
  }

14、Mmseg 分詞器

也支持 Elasticsearch

下載地址：https://github.com/medcl/elasticsearch-analysis-mmseg/releases 根據(jù)對應的版本進行下載

如何使用：

1、創(chuàng)建索引：

curl -XPUT http://localhost:9200/index

2、創(chuàng)建 mapping

curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
{
        "properties": {
            "content": {
                "type": "text",
                "term_vector": "with_positions_offsets",
                "analyzer": "mmseg_maxword",
                "search_analyzer": "mmseg_maxword"
            }
        }

}'

3.Indexing some docs

curl -XPOST http://localhost:9200/index/fulltext/1 -d'
{"content":"美國留給伊拉克的是個爛攤子嗎"}
'

curl -XPOST http://localhost:9200/index/fulltext/2 -d'
{"content":"公安部：各地校車將享最高路權"}
'

curl -XPOST http://localhost:9200/index/fulltext/3 -d'
{"content":"中韓漁警沖突調查：韓警平均每天扣1艘中國漁船"}
'

curl -XPOST http://localhost:9200/index/fulltext/4 -d'
{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}
'

4.Query with highlighting(查詢高亮)

curl -XPOST http://localhost:9200/index/fulltext/_search  -d'
{
    "query" : { "term" : { "content" : "中國" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
'

5、結果：

{
    "took": 14,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 2,
        "hits": [
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "4",
                "_score": 2,
                "_source": {
                    "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
                },
                "highlight": {
                    "content": [
                        "<tag1>中國</tag1>駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首 "
                    ]
                }
            },
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "3",
                "_score": 2,
                "_source": {
                    "content": "中韓漁警沖突調查：韓警平均每天扣1艘中國漁船"
                },
                "highlight": {
                    "content": [
                        "均每天扣1艘<tag1>中國</tag1>漁船 "
                    ]
                }
            }
        ]
    }
}

參考博客：

為elastic添加中文分詞: http://blog.csdn.net/dingzfang/article/details/42776693

15、bosonnlp （玻森數(shù)據(jù)中文分析器）

下載地址：https://github.com/bosondata/elasticsearch-analysis-bosonnlp

如何使用：

運行 ElasticSearch 之前需要在 config 文件夾中修改 elasticsearch.yml 來定義使用玻森中文分析器，并填寫玻森 API_TOKEN 以及玻森分詞 API 的地址，即在該文件結尾處添加：

index:
  analysis:
    analyzer:
      bosonnlp:
          type: bosonnlp
          API_URL: http://api.bosonnlp.com/tag/analysis
          # You MUST give the API_TOKEN value, otherwise it doesn't work
          API_TOKEN: *PUT YOUR API TOKEN HERE*
          # Please uncomment if you want to specify ANY ONE of the following
          # areguments, otherwise the DEFAULT value will be used, i.e.,
          # space_mode is 0,
          # oov_level is 3,
          # t2s is 0,
          # special_char_conv is 0.
          # More detials can be found in bosonnlp docs:
          # http://docs.bosonnlp.com/tag.html
          #
          #
          # space_mode: put your value here(range from 0-3)
          # oov_level: put your value here(range from 0-4)
          # t2s: put your value here(range from 0-1)
          # special_char_conv: put your value here(range from 0-1)

需要注意的是

必須在 API_URL 填寫給定的分詞地址以及在API_TOKEN：PUT YOUR API TOKEN HERE中填寫給定的玻森數(shù)據(jù)API_TOKEN，否則無法使用玻森中文分析器。該 API_TOKEN 是注冊玻森數(shù)據(jù)賬號所獲得。

如果配置文件中已經(jīng)有配置過其他的 analyzer，請直接在 analyzer 下如上添加 bosonnlp analyzer。

如果有多個 node 并且都需要 BosonNLP 的分詞插件，則每個 node 下的 yaml 文件都需要如上安裝和設置。

另外，玻森中文分詞還提供了4個參數(shù)（space_mode，oov_level，t2s，special_char_conv）可滿足不同的分詞需求。如果取默認值，則無需任何修改；否則，可取消對應參數(shù)的注釋并賦值。

測試：

建立 index

curl -XPUT 'localhost:9200/test'

測試分析器是否配置成功

curl -XGET 'localhost:9200/test/_analyze?analyzer=bosonnlp&pretty' -d '這是玻森數(shù)據(jù)分詞的測試'

結果

{
  "tokens" : [ {
    "token" : "這",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "是",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "玻森",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "數(shù)據(jù)",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "分詞",
    "start_offset" : 6,
    "end_offset" : 8,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "的",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "測試",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "word",
    "position" : 6
  } ]
}

配置 Token Filter

現(xiàn)有的 BosonNLP 分析器沒有內置 token filter，如果有過濾 Token 的需求，可以利用 BosonNLP Tokenizer 和 ES 提供的 token filter 搭建定制分析器。

步驟

配置定制的 analyzer 有以下三個步驟：

添加 BosonNLP tokenizer 在 elasticsearch.yml 文件中 analysis 下添加 tokenizer，并在 tokenizer 中添加 BosonNLP tokenizer 的配置：

index:
  analysis:
    analyzer:
      ...
    tokenizer:
      bosonnlp:
          type: bosonnlp
          API_URL: http://api.bosonnlp.com/tag/analysis
          # You MUST give the API_TOKEN value, otherwise it doesn't work
          API_TOKEN: *PUT YOUR API TOKEN HERE*
          # Please uncomment if you want to specify ANY ONE of the following
          # areguments, otherwise the DEFAULT value will be used, i.e.,
          # space_mode is 0,
          # oov_level is 3,
          # t2s is 0,
          # special_char_conv is 0.
          # More detials can be found in bosonnlp docs:
          # http://docs.bosonnlp.com/tag.html
          #
          #
          # space_mode: put your value here(range from 0-3)
          # oov_level: put your value here(range from 0-4)
          # t2s: put your value here(range from 0-1)
          # special_char_conv: put your value here(range from 0-1)

添加 token filter

在 elasticsearch.yml 文件中 analysis 下添加 filter，并在 filter 中添加所需 filter 的配置（下面例子中，我們以 lowercase filter 為例）：

index:
  analysis:
    analyzer:
      ...
    tokenizer:
      ...
    filter:
      lowercase:
          type: lowercase

添加定制的 analyzer

在 elasticsearch.yml 文件中 analysis 下添加 analyzer，并在 analyzer 中添加定制的 analyzer 的配置（下面例子中，我們把定制的 analyzer 命名為 filter_bosonnlp）：

index:
  analysis:
    analyzer:
      ...
      filter_bosonnlp:
          type: custom
          tokenizer: bosonnlp
          filter: [lowercase]

自定義分詞器

雖然Elasticsearch帶有一些現(xiàn)成的分析器，然而在分析器上Elasticsearch真正的強大之處在于，你可以通過在一個適合你的特定數(shù)據(jù)的設置之中組合字符過濾器、分詞器、詞匯單元過濾器來創(chuàng)建自定義的分析器。

字符過濾器：

字符過濾器用來整理一個尚未被分詞的字符串。例如，如果我們的文本是HTML格式的，它會包含像<p>或者<div>這樣的HTML標簽，這些標簽是我們不想索引的。我們可以使用 html清除字符過濾器來移除掉所有的HTML標簽，并且像把á轉換為相對應的Unicode字符 á 這樣，轉換HTML實體。

一個分析器可能有0個或者多個字符過濾器。

分詞器:

一個分析器必須有一個唯一的分詞器。分詞器把字符串分解成單個詞條或者詞匯單元。標準分析器里使用的標準分詞器把一個字符串根據(jù)單詞邊界分解成單個詞條，并且移除掉大部分的標點符號，然而還有其他不同行為的分詞器存在。

詞單元過濾器:

經(jīng)過分詞，作為結果的詞單元流會按照指定的順序通過指定的詞單元過濾器。

詞單元過濾器可以修改、添加或者移除詞單元。我們已經(jīng)提到過 lowercase 和 stop 詞過濾器，但是在 Elasticsearch 里面還有很多可供選擇的詞單元過濾器。詞干過濾器把單詞遏制為詞干。 ascii_folding 過濾器移除變音符，把一個像 "très" 這樣的詞轉換為 "tres" 。 ngram 和 edge_ngram 詞單元過濾器可以產(chǎn)生適合用于部分匹配或者自動補全的詞單元。

創(chuàng)建一個自定義分析器

我們可以在 analysis 下的相應位置設置字符過濾器、分詞器和詞單元過濾器:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

這個分析器可以做到下面的這些事:

1、使用 html清除字符過濾器移除HTML部分。

2、使用一個自定義的映射字符過濾器把 & 替換為 "和" ：

"char_filter": {
    "&_to_and": {
        "type":       "mapping",
        "mappings": [ "&=> and "]
    }
}

3、使用標準分詞器分詞。

4、小寫詞條，使用小寫詞過濾器處理。

5、使用自定義停止詞過濾器移除自定義的停止詞列表中包含的詞：

"filter": {
    "my_stopwords": {
        "type":        "stop",
        "stopwords": [ "the", "a" ]
    }
}

我們的分析器定義用我們之前已經(jīng)設置好的自定義過濾器組合了已經(jīng)定義好的分詞器和過濾器：

"analyzer": {
    "my_analyzer": {
        "type":           "custom",
        "char_filter":  [ "html_strip", "&_to_and" ],
        "tokenizer":      "standard",
        "filter":       [ "lowercase", "my_stopwords" ]
    }
}

匯總起來，完整的創(chuàng)建索引請求看起來應該像這樣：

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

索引被創(chuàng)建以后，使用 analyze API 來測試這個新的分析器：

GET /my_index/_analyze?analyzer=my_analyzer
The quick & brown fox

下面的縮略結果展示出我們的分析器正在正確地運行：

{
  "tokens" : [
      { "token" :   "quick",    "position" : 2 },
      { "token" :   "and",      "position" : 3 },
      { "token" :   "brown",    "position" : 4 },
      { "token" :   "fox",      "position" : 5 }
    ]
}

這個分析器現(xiàn)在是沒有多大用處的，除非我們告訴 Elasticsearch在哪里用上它。我們可以像下面這樣把這個分析器應用在一個 string 字段上：

PUT /my_index/_mapping/my_type
{
    "properties": {
        "title": {
            "type":      "string",
            "analyzer":  "my_analyzer"
        }
    }
}

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Elasticsearch 默認分詞器和中分分詞器之間的比較及使用方法

Elasticsearch 默認分詞器和中分分詞器之間的比較及使用方法

系統(tǒng)默認分詞器：

1、standard 分詞器

2、simple 分詞器

3、Whitespace 分詞器

4、Stop 分詞器

5、keyword 分詞器

6、pattern 分詞器

7、language 分詞器

8、snowball 分詞器

9、Custom 分詞器

10、fingerprint 分詞器

中文分詞器：

1、ik-analyzer

Elasticsearch添加中文分詞

ik 帶有兩個分詞器

熱詞更新配置

2、結巴中文分詞

特點：

3、THULAC

4、NLPIR

5、ansj分詞器

6、哈工大的LTP

7、庖丁解牛

8、sogo在線分詞

9、word分詞

10、jcseg分詞器

11、stanford分詞器

12、Smartcn

13、pinyin 分詞器

14、Mmseg 分詞器

15、bosonnlp （玻森數(shù)據(jù)中文分析器）

自定義分詞器

創(chuàng)建一個自定義分析器

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Elasticsearch 默認分詞器和中分分詞器之間的比較及使用方法

系統(tǒng)默認分詞器：

1、standard 分詞器

2、simple 分詞器

3、Whitespace 分詞器

4、Stop 分詞器

5、keyword 分詞器

6、pattern 分詞器

7、language 分詞器

8、snowball 分詞器

9、Custom 分詞器

10、fingerprint 分詞器

中文分詞器：

1、ik-analyzer

Elasticsearch添加中文分詞

ik 帶有兩個分詞器

熱詞更新配置

2、結巴中文分詞

特點：

3、THULAC

4、NLPIR

5、ansj分詞器

6、哈工大的LTP

7、庖丁解牛

8、sogo在線分詞

9、word分詞

10、jcseg分詞器

11、stanford分詞器

12、Smartcn

13、pinyin 分詞器

14、Mmseg 分詞器

15、bosonnlp （玻森數(shù)據(jù)中文分析器）

自定義分詞器

創(chuàng)建一個自定義分析器

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频