SpaCy v2.0 (一)淺譯--添加語言

Adding full support for a language touches many different parts of the spaCy library. This guide explains how to fit everything together, and points you to the specific workflows for each component.

添加一個完整的語言支持涉及很多不同部分的spaCy庫,本文針對如何融合所有內容,并說明每個組件的工作流程。


WORKING ON SPACY'S SOURCE(使用spaCy資源)

To add a new language to spaCy, you'll need to modify the library's code. The easiest way to do this is to clone the repository and build spaCy from source. For more information on this, see the installation guide. Unlike spaCy's core, which is mostly written in Cython, all language data is stored in regular Python files. This means that you won't have to rebuild anything in between –you can simply make edits and reload spaCy to test them.

要為spaCy添加新語言,需要修改library的代碼,最簡單的方法是克隆repository(https://github.com/explosion/spaCy),之后從源碼build。參見安裝指南中關于此方法的詳細內容。spaCy的核心代碼基本上都是用Cython寫的,不過所有的語言數據都是以常規的Python文件。這樣就不需要重建任何代碼,只需簡單的修改和重新調用spaCy就可以進行語言測試。

Obviously, there are lots of ways you can organise your code when you implement your own language data. This guide will focus on how it's done within spaCy. For full language support, you'll need to create a Language subclass, define custom language data, like a stop list and tokenizer exceptions and test the new tokenizer. Once the language is set up, you can build the vocabulary, including word frequencies, Brown clusters and word vectors. Finally, you can train the tagger and parser, and save the model to a directory.

部署自定義語言數據時有很多方法可以組織代碼。本文將聚焦于如何用spaCy完成。完整的語言支持,需要創建Language子集,聲明自定義語言數據,比如停用詞列表和例外分詞,并且測試新的分詞器。語言設置完成,就可以創建詞匯表,包括詞頻、布朗集(Brown Cluster)和詞向量。然后就可以訓練并保存Tagger和Parser模型了。

For some languages, you may also want to develop a solution for lemmatization and morphological analysis.

對于有的語言,可能還可以開發詞形還原和詞型分析的方案。


Language data 語言數據

Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. The lang? module contains all language-specific data, organised in simple Python files. This makes the data easy to update and extend.

每一種語言都不相同 – 而且通常都有很多例外和特殊情況,尤其是最常見的詞。其中一些例外情況是各語言間通用的,但其他的則是完全特殊的– 經常是特殊到需要硬編碼。spaCy中的lang模塊包含了大多數特殊語言數據,以簡單的Python文件進行組織,以便于升級和擴展數據。

The shared language data in the directory root includes rules that can be generalised? across languages – for example, rules for basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like " and ”. This helps the models make more accurate predictions. The individual language data in a submodule contains rules that are only relevant to a particular language. It also takes care of? putting together all components and creating the Language subclass – for example,English or German.

在根目錄中的通用語言數據包含了廣義的跨語言規則,例如:基本的標點、表情符號、情感符號、單字母縮寫的規則以及不同拼寫的等義標記,比如“and”。這樣有助于模型作出更準確的預測。子模塊中的特定語言數據包含的規則僅與特定語言相關,還負責整合所有組件和創建語言子集– 例如:英語 或 德語。

from spacy.lang.en import English

from spacy.lang.de import German

nlp_en = English() # includes English data

nlp_de = German() # includes German data


Stop?words stop_words.py

List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return?True?for?is_stop.

停用詞語言中進行數據處理之前或之后通常會自動過濾掉某些字或詞的列表

Tokenizer?exceptions?tokenizer_exceptions.py

Special-case rules for the tokenizer, for example,? contractions like “can’t” and abbreviations with punctuation, like “U.K.”.

例外分詞特殊分詞,例如:縮寫can’t和帶標點的縮寫詞U.K.(常規中文好像沒這情況)

Norm? exceptions?norm_exceptions.py

Special-case rules for normalising tokens to improve the? model's predictions, for example on American vs. British spelling.??

Punctuation?rules??punctuation.py

Regular expressions for splitting tokens, e.g. on? punctuation or special characters like emoji. Includes rules for prefixes,? suffixes and infixes.

標點規則標點或特殊字符(如表情)等等正則表達式,包括前綴、后綴和連接符的規則。

Character?classes?char_classes.py

Character classes to be used in regular expressions, for? example, latin characters, quotes, hyphens or icons.

字符集正則表達式中所用的字符集,例如:拉丁、引用、連字符或圖標等

Lexical Attributes?lex_attrs.py

Custom functions for setting lexical tributes on tokens, e.g.?like_num, which? includes language-specific words like “ten” or “hundred”.

詞性例如:like_num:包括十、百、千等特殊詞。

Syntax?iterators?syntax_iterators.py

Functions that compute views of a?Doc?object based on its syntax. At the moment, only used for?noun-chunkes

Lemmatizer?lemmatizer.py

Lemmatization rules or a lookup-based lemmatization table? to assign base forms, for example "be" for "was".

詞型還原英語討厭的時態、單復數,偉大的中文不這么土

Tag?map tag_map.py

Dictionary mapping strings in your tag set to?Universal Dependencies tags.

Morph?rules morph_rules.py

Exception rules for morphological analysis of irregular? words like personal pronouns.

詞變形規則?


The individual components?expose variables?that can be imported within a language module, and added to the language's?Defaults. Some components, like the punctuation rules, usually don't need much customisation and can simply be imported from the global rules. Others, like the tokenizer and norm exceptions, are very specific and will make a big difference to spaCy's performance on the particular language and training a language model.

個別組件可以到語言模塊中,被加入到語言的Defaults。有些組件比如標點符號規則,通常不需要很多自定義,而是簡單的引入通用規則。其他的如tokenizer和norm exceptions很特別,會較大程度上影響spaCy對特定語言和訓練語言模型的性能效果。


SHOULDI EVER UPDATE THE GLOBAL DATA?

Reuseable language data is collected as atomic pieces in the root of the?spacy.lang??package. Often, when a new language is added, you'll find a pattern or symbol that's missing. Even if it isn't common in other languages, it might be best to add it to the shared language data, unless it has some conflicting interpretation. For instance, we don't expect to see guillemot quotation symbols (??and??) in English text. But if we do see them, we'd probably prefer the tokenizer to split them off.

是否應更新全局數據

可復用的語言數據作為原子碎片被置于spacy.lang包的根節點。通常,添加新語言后,會發現有圖案或符號缺失。即使在其他語言中并不常見,或許最好還是將其加入通用語言數據中,除非有沖突。

FORLANGUAGES WITH NON-LATIN CHARACTERS

In order for the tokenizer to split suffixes, prefixes and infixes, spaCy needs to know the language's character set. If the language you're adding uses non-latin characters, you might need to add the required character classes to the global?char_classes.py?. spaCy uses the?regex?library?to keep this simple and readable. If the language requires very specific punctuation rules, you should consider overwriting the default regular expressions with your own in the language's?Defaults.

中文的全角標點符號需要定義。


The Language subclass 語言子集

Language-specific code and resources should be organised into a subpackage of spaCy, named according to the language's ISO code. For instance, code and resources specific to Spanish are placed into a directory spacy/lang/es, which can be imported as spacy.lang.es.

特定語言代碼和資源應組織為spaCy的子包,以語言標準編碼(ISO)命名,例如:中文應位于spacy/lang/zh目錄,就能夠以 spacy.lang.zh 引入了。

To get started, you can use our templates for the most important files. Here's what the class template looks like:

最重要的文件的模版:

__INIT__.PY (EXCERPT)

# import language-specific data

from .stop_words import STOP_WORDS

from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS

from .lex_attrs import LEX_ATTRS

from ..tokenizer_exceptions import BASE_EXCEPTIONS

from ...language import Language

from ...attrs import LANG

from ...util import update_exc

# create Defaults class in the module scope (necessary for pickling!)

class XxxxxDefaults(Language.Defaults):

? ? lex_attr_getters = dict(Language.Defaults.lex_attr_getters)

? ? lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code

? ? # optional: replace flags with custom functions, e.g. like_num()

? ? lex_attr_getters.update(LEX_ATTRS)

? ? # merge base exceptions and custom tokenizer exceptions

? ? tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)

? ? stop_words = STOP_WORDS

# create actual Language class

class Xxxxx(Language):

? ? lang = 'xx' # language ISO code

? ? Defaults = XxxxxDefaults # override defaults

# set default export – this allows the language class to be lazy-loaded

__all__ = ['Xxxxx']


WHY LAZY-LOADING?

Some languages contain large volumes of custom data, like lemmatizer lookup tables, or complex regular expression that are expensive to compute. As of spaCy v2.0, Language classes are not imported on initialisation and are only loaded when you import them directly, or load a model that requires a language to be loaded. To lazy-load languages in your application, you can use the util.get_lang_class()? helper function with the two-letter language code as its argument.

為什么要延遲加載

有些語言包含大量的定制數據,復雜規則等,計算成本很高。spaCy2.0中,語言集不在初始化時引入,僅于import的時候才加載,或者加載包含語言的模型。在應用中延遲加載語言,使用util.get_lang_class(),參數為兩位語言編碼。


Stop words 停用詞

A "stop list" is a classic trick from the early days of information retrieval when search was largely about keyword presence and absence. It is still sometimes useful today to filter out common words from a bag-of-words model. To improve readability, STOP_WORDS are separated by spaces and newlines, and added as a multiline string.

停用詞意義不用多說了,一切為了效率和質量。

WHAT DOES SPACY CONSIDER ASTOP WORD?

There's no particularly principled logic behind what words should be added to the stop list. Make a list that you think might be useful to people and is likely to be unsurprising. As a rule of thumb, words that are very rare are unlikely to be useful stop words.

spaCy是如何考慮停用詞的?

關于什么詞應該加入停用詞表,沒有什么特別的原則性邏輯。建議直接參考使用復旦或哈工大的停用詞,比較成熟。

關鍵還是怎么定義怎么用,看定義樣例:

EXAMPLE

STOP_WORDS = set(""" a about above across after afterwards again against all almost alone along already also although always am among amongst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before before hand behind being below beside besides between beyond both bottom but by """).split())

樣例中引號里的一堆詞就是停用詞們,把中文的停用詞表加進去就OK了。

IMPORTANT NOTE

When adding stop words from an online source, always include the link in a comment. Make sure to proofread and double-check the words carefully. A lot of the lists available online have been passed around for years and often contain mistakes, like unicode errors or random words that have once been added for a specific use case, but don't actually qualify.

重要!!!

一定要反復校對那些詞,網上的很多詞表已經過時了,而且經常有錯誤(最常見unicode錯誤)。


Tokenizer exceptions 例外分詞

spaCy's tokenization algorithm lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule is applied, giving the defined sequence of tokens. You can also attach attributes to the subtokens, covered by your special case, such as the subtokens LEMMA orTAG.

spaCy的分詞算法可以處理空格和tab分隔。很容易定義特殊情況規則,不需擔心與其他分詞器的相互影響。一旦key string匹配,規則就會生效,給出定義好的分詞序列。也可以附加屬性覆蓋特殊情況定義,例如 LEMMA 或 TAG。

IMPORTANTNOTE

If an exception consists of more than one token, the ORTH values combined always need to match the original string. The way the original string is split up can be pretty arbitrary sometimes –for example "gonna" is split into"gon" (lemma "go") and "na" (lemma"to"). Because of how the tokenizer works, it's currently not possible to split single-letter strings into multiple tokens.

重要!!!

例如:Gonna 定義為 gon(go)和 na(to),單個字母不可能再split。中文沒這么垃圾的東西吧。

Unambiguous abbreviations, like month names or locations in English, should be added to exceptions with a lemma assigned, for example {ORTH: "Jan.", LEMMA: "January"}. Since the exceptions are added in Python, you can use custom logic to generate them more efficiently and make your data less verbose. How you do this ultimately depends on the language. Here's an example of how exceptions for time formats like"1a.m." and "1am" are generated in the English tokenizer_exceptions.py:

縮寫問題,月份縮寫,地點縮寫等,例如:Jan. 還原為 January,那么中文就還原為 一月吧,具體情況取決于語言,比如定制中文時,忽略 Jan這種情況。以下是英文時間的定義樣例tokenizer_exceptions.py:

# use short, internal variable for readability

_exc = {}

for h in range(1, 12 + 1):

??? for period in["a.m.", "am"]:

??????? # always keep an eye onstring interpolation!

??????? _exc["%d%s" %(h, period)] = [

??????????? {ORTH: "%d"% h},

??????????? {ORTH: period, LEMMA:"a.m."}]

??? for period in["p.m.", "pm"]:

??????? _exc["%d%s" %(h, period)] = [

??????????? {ORTH: "%d"% h},

??????????? {ORTH: period, LEMMA:"p.m."}]

# only declare this at the bottom

TOKENIZER_EXCEPTIONS = _exc

GENERATINGTOKENIZER EXCEPTIONS

Keep in mind that generating exceptions only makes sense if there's a clearly defined and finite number of them, like common contractions in English. This is not always the case –in Spanish for instance, infinitive or imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). Incases like this, spaCy shouldn't be generating exceptions for all verbs.Instead, this will be handled at a later stage during lemmatization.

生成TOKENIZER EXCEPTIONS

要注意,只有明確定義的和有限數量的例外定義才合理,比如英文中的常見縮寫。其他語言視具體情況不同,spaCy不能夠為所有詞匯生成例外規則。可以試試后文提到的lemmatization(詞干提取)。

When adding the tokenizer exceptions to theDefaults, you can use the update_exc()?helper function to merge them with the global base exceptions (including one-letter abbreviations and emoticons). The function performs a basic check to make sure exceptions are provided in the correct format. It can take any number of exceptions dicts as its arguments, and will update and overwrite the exception in this order. For example, if your language's tokenizer exceptions include a custom tokenization pattern for "a.", it will overwrite the base exceptions with the language's custom one.

在缺省定義中添加tokenizer exceptions時,可以使用update_exc() 輔助函數以合并至全局設置(包括單字符縮寫和表情)。該函數執行基本的格式合法性檢驗,且可以使用多個例外字典作為參數,并且將更新覆蓋原定義。

EXAMPLE

from ...util import update_exc

BASE_EXCEPTIONS =?{"a.": [{ORTH: "a."}], ":)": [{ORTH:":)"}]}

TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA:"all"}]}

tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)

# {"a.": [{ORTH: "a.", LEMMA: "all"}],":}": [{ORTH: ":}"]]}

ABOUTSPACY'S CUSTOM PRONOUN LEMMA

Unlike verbs and common nouns, there's no clear base form of a personal pronoun. Should the lemma of "me" be "I", or should we normalize person as well, giving "it" —or maybe "he"? spaCy's solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.

關于spaCy的代詞詞元定義

不同于動詞和常規名詞,人稱代詞沒有基本格式。中文比英文好些,拿英文說事吧,me應該是I,或者應該規范為人也行,還有it或者也可以是he?spaCy的解決方案是引入一個專有標志 –PRON- ,用來標記所有人稱代詞。


Norm exceptions 例外規范

In addition to ORTH or LEMMA, tokenizer exceptions can also set a NORM attribute. This is useful to specify a normalised version of the token –for example, the norm of "n't" is "not". By default, a token's norm equals its lowercase text. If the lowercase spelling of a word exists, norms should always be in lowercase.

除了ORTH和詞元之外,tokenizer exceptions也可以設置一個規范屬性。指定一個標準版本的token很有用,例如,還是英文舉例(中文好像沒這么亂吧):n’t是not。默認情況下,一個token的規范是小寫文本。如果一個詞的小寫存在,規范應該一直是小寫(中文的小寫大寫問題好像只有數字吧,該不該算進去呢?)。

NORMS VS. LEMMAS

doc = nlp(u"I'm gonna realise")

norms = [token.norm_ for token in doc]

lemmas = [token.lemma_ for token in doc]

assert norms == ['i', 'am', 'going', 'to', 'realize']

assert lemmas == ['i', 'be', 'go', 'to', 'realise']

spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words –for example, "realize" and "realise", or "thx" and"thanks".

spaCy通常會嘗試將同一個詞的不同拼寫規范化,常規化(這就是拼寫文字和象形文字的不同了)。這在其他token屬性或一般tokenization中沒有效果,但是這確保等效tokens得到類似的表述。這樣就能夠提升模型對那些在訓練數據中不常見,但是同其他詞差不多的詞的預測能力,例如:realize和realizse,或者thx和thanks。(中文有啥?謝了 – 謝謝了 – 謝謝您了 – 太謝謝您了 ……中文有這必要嗎)

Similarly, spaCy also includes global base norms for normalising different styles of quotation marks and currency symbols. Even though $ and €are very different, spaCy normalises them both to $. This way, they'll always be seen as similar, no matter how common they were in the training data.

同樣的,spaCy也包括將不通類型的引號和貨幣符號規范化的全局基本規范(https://github.com/explosion/spaCy/blob/master/spacy/lang/norm_exceptions.py)。即使 $ 和¥ 有很大差別,spaCy會將它們統一規范為 $。這樣,不論它們在訓練數據中有多常見,都將被同等處理。

Norm exceptions can be provided as a simple dictionary. For more examples, see the English norm_exceptions.py .

Norm exceptions可以被作為一個簡單的字典。更多樣例參見英文語言中的norm_exceptions.py

EXAMPLE

NORM_EXCEPTIONS = {

??? "cos":"because",

??? "fav":"favorite",

??? "accessorise":"accessorize",

??? "accessorised":"accessorized"

}

To add the custom norm exceptions lookup table, you can use the?add_lookups()?helper functions. It takes the default attribute getter function as its first argument, plus a variable list of dictionaries. If a string's norm is found in one of the dictionaries, that value is used – otherwise, the default function is called and the token is assigned its default norm.

通過add_lookups()輔助函數來添加自定義norm exceptions查詢表。它使用默認屬性的getter函數作為其第一個參數,外加一個字典變量表。如果在某個字典中發現了一個字符串的規范,則取值– 否則,調用默認函數并且將默認規范賦值給token。

lex_attr_getters[NORM] =add_lookups(Language.Defaults.lex_attr_getters[NORM],? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? NORM_EXCEPTIONS, BASE_NORMS)

The order of the dictionaries is also the lookup order –so if your language's norm exceptions overwrite any of the global exceptions, they should be added first.Also note that the tokenizer exceptions will always have priority over the attribute getters.

字典的排序也是查詢排序– 所以,如果語言的norm exceptions覆蓋了任何全局exceptions,將被首先添加。同時注意tokenizer exceptions總是優先于屬性getter。


Lexical attributes 詞性

spaCy provides a range of Token attributes that return useful information on that token –for example, whether it's uppercase or lowercase, a left or right punctuation mark, or whether it resembles a number or email address. Most of these functions, like is_lower or like_url should be language-independent. Others, like like_num(which includes both digits and number words), requires some customisation.

spaCy提供了一堆Token屬性來返回token的有用信息,例如:無論大寫還是小寫形式,左右引號,或不論是類似于數字或email地址。大部分函數,比如:is_lower或者like_urls都應該是獨立語言的。其他的像like_num(包括數字和大寫數字),則需要進行定制。

BEST PRACTICES

English number words are pretty simple, because even large numbers consist of individual tokens, and we can get away with splitting and matching strings against a list. In other languages, like German, "two hundred and thirty-four" is one word, and thus one token. Here, it's best to match a string against a list of number word fragments (instead of a technically almost infinite list of possible number words).

最佳方案

英文數字單詞非常簡單,因為即使大數字也是由獨立的tokens組成的,我們可以避免分隔和靠列表匹配字符串。其他語言中,比如德語,two hundred and thirty-four是一個詞,也是一個token。這里最好是基于一個數字單詞片段的列表(而不是技術上幾乎無限的可能的數字單詞的列表)進行字符串匹配。(這一塊中文也應該是一樣原理了)

英文詞性定義樣例:

LEX_ATTRS.PY

_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six','seven',

????????????? 'eight', 'nine','ten', 'eleven', 'twelve', 'thirteen', 'fourteen',

????????????? 'fifteen','sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',

????????????? 'thirty', 'forty','fifty', 'sixty', 'seventy', 'eighty', 'ninety',

????????????? 'hundred','thousand', 'million', 'billion', 'trillion', 'quadrillion',

????????????? 'gajillion','bazillion']

def like_num(text):

??? text = text.replace(',','').replace('.', '')

??? if text.isdigit():

??????? return True

??? if text.count('/') == 1:

??????? num, denom =text.split('/')

??????? if num.isdigit() anddenom.isdigit():

??????????? return True

??? if text.lower() in _num_words:

??????? return True

??? return False

LEX_ATTRS = {

??? LIKE_NUM: like_num

}

By updating the default lexical attributeswith a custom LEX_ATTRS dictionary in the language's defaults vialex_attr_getters.update(LEX_ATTRS), only the new custom functions are overwritten.

通過lex_getters.update(LEX_ATTRS)使用一個定制LEX_ATTRS字典更新語言默認詞性屬性,只有新定義的函數會被覆蓋。


Syntax iterators 語法迭代器

Syntax iterators are functions that compute views of a Doc object based on its syntax. At the moment, this data is only used for extracting noun chunks, which are available as the Doc.noun_chunks property.Because base noun phrases work differently across languages, the rules to compute them are part of the individual language's data. If a language does not include a noun chunks iterator, the property won't be available. For examples, see the existing syntax iterators:

語法迭代器是計算基于語法的DOC對象的視圖的函數。目前,數據僅用來提取詞塊,其屬性為Doc.noun_chunks。因為基本名詞短語的工作各語言不同,計算規則為各語言數據的一部分。如果語言不包含一個詞塊迭代器,則沒有noun_chunks屬性。如下例:

NOUN CHUNKS EXAMPLE

doc = nlp(u'A phrase with another phrase occurs.')

chunks = list(doc.noun_chunks)

assert chunks[0].text == "A phrase"

assert chunks[1].text == "another phrase"



Lemmatizer詞形還原器

As of v2.0, spaCy supports simple lookup-based lemmatization. This is usually the quickest and easiest way to get started. The data is stored in a dictionary mapping a string to its lemma. To determine a token's lemma, spaCy simply looks it up in the table. Here's an example from the Spanish language data:

截至v2.0,spaCy支持簡單的基于查詢的詞形還原。這一般是最快最簡單的入門方法。字典數據映射詞形字符串。要判定一個token的詞形,spaCy會于查詢表中快速查找。西班牙文樣例:

LANG/ES/LEMMATIZER.PY (EXCERPT)

LOOKUP = {

??? "aba":"abar",

??? "ababa":"abar",

??? "ababais":"abar",

??? "ababan":"abar",

??? "ababanes":"ababán",

??? "ababas":"abar",

??? "ababoles":"ababol",

??? "ababábites":

"ababábite"

}

To provide a lookup lemmatizer for your language, import the lookup table and add it to the Language class as lemma_lookup:

引入查詢表到語言子集的lemma_lookup,為語言提供詞型還原器,方法如下例:

lemma_lookup = dict(LOOKUP)


Tag map

Most treebanks define a custom part-of-speechtag scheme, striking a balance between level of detail and ease of prediction. While it's useful to have custom tagging schemes, it's also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.

多數樹庫都聲明一個自定義詞類標簽體系,打破細節和易預測性水平之間的平衡(沒搞明白)。自定義標簽體系很有用,常規體系也很有用,其中更多的標簽可以關聯起來。標記器能夠以任意符號學習一個標簽體系。不過需要定義這些符號映射到Universal Dependencies tag set(這玩意兒很有用)。這就要通過提供一個tag map做到了。

The keys of the tag map should be strings in your tag set. The values should be a dictionary. The dictionary must have an entry POS whose value is one of the Universal Dependencies tags. Optionally, you can also include morphological features or other token attributes in the tag map as well. This allows you to do simple rule-based morphological analysis.

Tag map的keys應該是標簽集中的字符串。Value應該是字典。字典必須有POS記錄,其值為Universal Dependencies tags中的一個。另外,還可以在tag map中包含詞法特征或者token的其他屬性,這樣就可以進行簡單的基于規則的形態分析了。

下面看樣例:

EXAMPLE

from ..symbols import POS, NOUN, VERB, DET

TAG_MAP = {

??? "NNS":? {POS: NOUN, "Number":"plur"},

??? "VBG":? {POS: VERB, "VerbForm":"part", "Tense": "pres", "Aspect":"prog"},

??? "DT":?? {POS: DET}

}


Morph rules 形態規則

The morphology rules let you set token attributes such as lemmas, keyed by the extended part-of-speech tag and token text. The morphological features and their possible values are language-specific and based on the Universal Dependencies scheme.

形態規則設置token的屬性,比如詞形,鍵的擴展詞性標簽和token的文本。詞法(形態)特征及其可能的值為語言特征,且基于Universal Dependencies體系。

EXAMPLE

from ..symbols import LEMMA

MORPH_RULES = {

??? "VBZ": {

??????? "am": {LEMMA:"be", "VerbForm": "Fin", "Person":"One", "Tense": "Pres", "Mood":"Ind"},

??????? "are": {LEMMA:"be", "VerbForm": "Fin", "Person":"Two", "Tense": "Pres", "Mood":"Ind"},

??????? "is": {LEMMA:"be", "VerbForm": "Fin", "Person":"Three", "Tense": "Pres", "Mood":"Ind"},

??????? "'re": {LEMMA:"be", "VerbForm": "Fin", "Person":"Two", "Tense": "Pres", "Mood":"Ind"},

??????? "'s": {LEMMA:"be", "VerbForm": "Fin", "Person":"Three", "Tense": "Pres", "Mood":"Ind"}

??? }

}

上例中“am”的屬性如下:

IMPORTANT NOTE

The morphological attributes are currently not all used by spaCy. Full integration is still being developed. In the meantime, it can still be useful to add them, especially if the language you're adding includes important distinctions and special cases. This ensures that as soon as full support is introduced, your language will be able to assign all possible attributes.

重要!!!

形態屬性目前沒有完全應用于spaCy,完整內容還在開發中。其間,加上該屬性還是很有用的,特別是如果添加的語言包含重要區別和特殊情況。這樣就確保了當完整支持完成后,就可以快速引入所有可能的屬性了。


Testing the language 測試語言

Before using the new language or submitting a pull request to spaCy, you should make sure it works as expected. This is especially important if you've added custom regular expressions for token matching or punctuation –you don't want to be causing regressions.

在使用一個新的語言或者向spaCy提交更新請求前,應確定它能達到預期。特別重要的是如果添加了自定義token匹配或標點符號的正則表達式,省的后悔。。。

SPACY'STEST SUITE

spaCy uses the pytest framework for testing.For more details on how the tests are structured and best practices for writing your own tests, see our tests documentation.

spaCy的測試包

spaCy使用pytest框架進行測試。關于更多的測試結構和制作自己的測試的最佳方案,參見測試文檔https://github.com/explosion/spaCy/blob/master/spacy/tests

The easiest way to test your new tokenizer is to run the language-independent "tokenizer sanity" tests located in tests/tokenizer . This will test for basic behaviours like punctuation splitting, URL matching and correct handling of whitespace. In the conftest.py, add the new language ID to the list of _languages:

測試新tokenizer的最簡單方法是運行“tokenizer sanity”,位于tests/tokenizer。這將對一些基本功能進行測試,如標點符號分隔,URL匹配以及空格的正確處理(中文還空格?)。在conftest.py文件的_languages列表中添加新語言的ID。

_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'it','nb',

????????????? 'nl', 'pl', 'pt','sv', 'xx'] # new language here

GLOBAL TOKENIZER TEST EXAMPLE

# use fixture by adding it as an argument

def test_with_all_languages(tokenizer):

??? # will be performed on ALL language tokenizers

??? tokens = tokenizer(u'Some texthere.')

The language will now be included in the tokenizer test fixture, which is used by the basic tokenizer tests. If you want to add your own tests that should be run over all languages, you can use this fixture as an argument of your test function.

現在語言已經被包含到tokenizer測試fixture里了,用來進行基本的tokenizer測試。如果想用自己的測試運行所有語言,可以將這個fixture以參數形式加入測試函數。


Writing language-specific tests 寫一個特定語言的測試

It's recommended to always add at least some tests with examples specific to the language. Language tests should be located in tests/lang? in a directory named after the language ID. You'll also need to create a fixture for your tokenizer in the conftest.py . Always use the get_lang_class() helper function within the fixture, instead of importing the class at the top of the file. This will load the language data only when it's needed. (Otherwise, all data would be loaded every time you run a test.)

強烈推薦為定制的語言添加測試集。語言測試集應位于tests/lang路徑內的以語言ID命名的目錄中。同時,需要在conftest.py中創建一個fixture。在fixture內使用get_lang_class()函數,不要在文件頭import class。這樣就會僅在需要時加載語言數據。(否則,在每次執行測試時都會加載所有數據,就累了。。。)

@pytest.fixture

def en_tokenizer():

??? returnutil.get_lang_class('en').Defaults.create_tokenizer()

When adding test cases, always parametrize them –this will make it easier for others to add more test cases without having to modify the test itself. You can also add parameter tuples, for example, a test sentence and its expected length, or a list of expected tokens. Here's an example of an English tokenizer test for combinations of punctuation and abbreviations:

添加測試案例時,使其參數化,以便于別人方便的添加更多測試案例,而不用去修改測試主體。還可以添加參數元祖,比如:一條測試語句及其預期長度,或者預期tokens的列表。下面例子是一個英文的標點符號和縮寫組合的tokenizer測試:

EXAMPLE TEST

@pytest.mark.parametrize('text,length', [

??? ("The U.S. Army likesShock and Awe.", 8),

??? ("U.N. regulations arenot a part of their concern.", 10),

??? ("“Isn't

it?”", 6)])

def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length):

??? tokens = en_tokenizer(text)

??? assert len(tokens) == length


Training訓練一個語言模型

spaCy expects that common words will be cached in a Vocab instance. The vocabulary caches lexical features, and makes it easy to use information from unlabelled text samples in your models. Specifically, you'll usually want to collect word frequencies, and train word vectors. To generate the word frequencies from a large, raw corpus, you can use the word_freqs.py? script from the spaCy developer resources.

spaCy認為一般詞匯都可以在詞匯表實例中獲得。詞匯獲得詞性標注,使用模型為標記的文本信息也變得簡單了。特別是收集詞頻,訓練詞向量。從一個又大又新大語料中生成詞頻,可以使用spaCy developer resources中的word_freqs.py

Note that your corpus should not be preprocessed (i.e. you need punctuation for example). The word frequencies should be generated as a tab-separated file with three columns:

1、The number of times the word occurred in your language sample.

2、The number of distinct documents the word occurred in.

3、The word itself.

注意:語料需未經預處理(即要為樣本加上標點符號)。詞頻文件應被生成為tab分隔的三列內容:

第一列:詞條在語言樣品出現的次數。

第二列:出現詞條的文檔數

第三列:詞條內容

ES_WORD_FREQS.TXT

6361109?????? 111 Aunque

23598543???? 111 aunque

10097056???? 111 claro

193454? 111 aro

7711123?????? 111 viene

12812323???? 111 mal

23414636???? 111 momento

2014580?????? 111 felicidad

233865? 111 repleto

15527??? 111 eto

235565? 111 deliciosos

17259079???? 111 buena

71155??? 111 Anímate

37705??? 111 anímate

33155??? 111 cuéntanos

2389171?????? 111 cuál

961576? 111 típico


BROWN CLUSTERS 布朗聚類

Additionally, you can use distributional similarity features provided by the Brown clustering algorithm.You should train a model with between 500 and 1000 clusters. A minimum frequency threshold of 10 usually works well.

另外,可以使用布朗聚類算法提供的分布相似性特征。可以訓練一個500-1000clusters的模型,最低頻的閥值為10通常效果不錯。

You should make sure you use the spaCy tokenizer for your language to segment the text for your word frequencies. This will ensure that the frequencies refer to the same segmentation standards you'll be using at run-time. For instance, spaCy'sEnglish tokenizer segments "can't" into two tokens. If we segmented the text by whitespace to produce the frequency counts, we'll have incorrect frequency counts for the tokens "ca" and "n't".

你應該確定要用spaCy的tokenizer為你的語言進行詞頻的分詞。這樣就可以確保在運行時,詞頻參考相同的分詞標準。比如說,spaCy的英文tokenizer將can’t分詞為兩個tokens。如果用空格處理詞頻計數,結果將出現ca和n’t的錯誤詞頻計數。


Training the word vectors 訓練詞向量

Word2vec and related algorithms let you train useful word similarity models from unlabelled text.This is a key part of using deep learning for NLP with limited labelled data.The vectors are also useful by themselves – they power the .similarity()methods in spaCy. For best results, you should pre-process the text with spaCy before training the Word2vec model. This ensures your tokenization will match.You can use our word vectors training script , which pre-processes the text with your language-specific tokenizer and trains the model using Gensim. The vectors.bin file should consist of one word and vector per line.

Word2vec以及相關算法能夠從未標記文本中訓練有用的詞條相似度模型,這是對有限標記數據NLP的關鍵部分。向量本身也是很有用的-power了spaCy中的.similarity()函數。為了最佳結果,訓練word2vec模型之前應該先用spaCy對文本進行預處理,這就確保了tokenizer能夠匹配。可以直接用spaCy的vector訓練腳本(https://github.com/explosion/spacy-dev-resources/blob/master/training/word_vectors.py),對定制語言文本tokenizer進行預處理,并且用Gensim(https://radimrehurek.com/gensim/)訓練模型。vectors.bin文件的每一行包含一個詞條和向量值。


Training the tagger and parser 訓練標簽器和解釋器

You can now train the model using a corpus for your language annotated with Universal Dependencies.If your corpus uses the CoNLL-U format, i.e. files with the extension .conllu, you can use the convert command to convert it to spaCy's JSON format for training. Once you have your UD corpus transformed into JSON, you can train your model use the using spaCy's train?command.

現在可以用定制語言的語料和Universal Dependencies訓練模型了。如果語料使用CoNLL-U格式,即以.conllu為擴展名的文件,可以用convert命令將其轉換為spaCy的JSON格式進行訓練。UD語料轉換為JSON后就可以用spaCy的train命令訓練模型了。

For more details and examples of how to train the tagger and dependency parser, see the usage guide on training.

更關于訓練tagger和parser的細節和樣例請看分析模型訓練指南。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 228,363評論 6 532
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,497評論 3 416
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 176,305評論 0 374
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,962評論 1 311
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,727評論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,193評論 1 324
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,257評論 3 441
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,411評論 0 288
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,945評論 1 335
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,777評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,978評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,519評論 5 359
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,216評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,642評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,878評論 1 286
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,657評論 3 391
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,960評論 2 373

推薦閱讀更多精彩內容