波多野结衣120分钟av作品,国产成人精品aa毛片,我在野外截取一段视频

Python 2.7
IDE PyCharm 5.0.3

數(shù)據(jù)分析熱個(gè)身啊，反正也看到自然語(yǔ)言處理這塊了。。

講在開(kāi)頭

此文需要用到的相關(guān)知識(shí)包括數(shù)據(jù)清洗，正則表達(dá)式，字典，列表等。不然可能有點(diǎn)費(fèi)勁。

什么是N-Gram模型？

在自然語(yǔ)言里有一個(gè)模型叫做n-gram，表示文字或語(yǔ)言中的n個(gè)連續(xù)的單詞組成序列。在進(jìn)行自然語(yǔ)言分析時(shí)，使用n-gram或者尋找常用詞組，可以很容易的把一句話分解成若干個(gè)文字片段。摘自Python網(wǎng)絡(luò)數(shù)據(jù)采集[RyanMitchell著]

簡(jiǎn)單來(lái)說(shuō)，就是找到核心主題詞，那怎么算核心主題詞呢，一般而言，重復(fù)率也就是提及次數(shù)最多的也就是最需要表達(dá)的就是核心詞。下面的例子也就從這個(gè)開(kāi)始展開(kāi)

臨時(shí)補(bǔ)充

在栗子中出現(xiàn)，這里拿出來(lái)單獨(dú)先試一下效果

1.string.punctuation獲取所有標(biāo)點(diǎn)符號(hào)，和strip搭配使用

import string
list = ['a,','b!','cj!/n']
item=[]
for i in list:
    i =i.strip(string.punctuation)
    item.append(i)
print item

['a', 'b', 'cj!/n']

2.operator.itemgetter()
operator模塊提供的itemgetter函數(shù)用于獲取對(duì)象的哪些維的數(shù)據(jù)，參數(shù)為一些序號(hào)（即需要獲取的數(shù)據(jù)在對(duì)象中的序號(hào)）

栗子

import operator
dict={'name1':'2',
      'name2':'1'}

print sorted(dict.items(),key=operator.itemgetter(0),reverse=True)
#dict.items()，鍵值對(duì)

[('name2', '1'), ('name1', '2')]

2-gram

就以兩個(gè)關(guān)鍵詞來(lái)說(shuō)吧，上個(gè)栗子來(lái)進(jìn)行備注講解

import urllib2
import re
import string
import operator

def cleanText(input):
    input = re.sub('\n+', " ", input).lower() # 匹配換行用空格替換成空格
    input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標(biāo)記
    input = re.sub(' +', " ", input) #  把連續(xù)多個(gè)空格替換成一個(gè)空格
    input = bytes(input)#.encode('utf-8') # 把內(nèi)容轉(zhuǎn)換成utf-8格式以消除轉(zhuǎn)義字符
    #input = input.decode("ascii", "ignore")
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ') #以空格為分隔符，返回列表


    for item in input:
        item = item.strip(string.punctuation) # string.punctuation獲取所有標(biāo)點(diǎn)符號(hào)

        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞，包括i,a等單個(gè)單詞
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)

    output = {} # 構(gòu)造字典
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')
        if ngramTemp not in output: #詞頻統(tǒng)計(jì)
            output[ngramTemp] = 0 #典型的字典操作
        output[ngramTemp] += 1
    return output

#方法一：對(duì)網(wǎng)頁(yè)直接進(jìn)行讀取
content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
#方法二：對(duì)本地文件的讀取，測(cè)試時(shí)候用，因?yàn)闊o(wú)需聯(lián)網(wǎng)
#content = open("1.txt").read()
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) #=True 降序排列
print(sortedNGrams)

[('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34),,,巴拉巴拉一堆

上述栗子作用在于抓到2連接詞的頻率大小來(lái)排序的，但是這并不是我們想要的，你說(shuō)這出現(xiàn)兩百多次的 of the 有個(gè)貓用啊，所以，我們要進(jìn)行對(duì)這些連接詞啊介詞啊的剔除工作。

Deeper

# -*- coding: utf-8 -*-
import urllib2

import re
import string
import operator

#剔除常用字函數(shù)
def isCommon(ngram):
    commonWords = ["the", "be", "and", "of", "a", "in", "to", "have",
                   "it", "i", "that", "for", "you", "he", "with", "on", "do", "say",
                   "this", "they", "is", "an", "at", "but","we", "his", "from", "that",
                   "not", "by", "she", "or", "as", "what", "go", "their","can", "who",
                   "get", "if", "would", "her", "all", "my", "make", "about", "know",
                   "will","as", "up", "one", "time", "has", "been", "there", "year", "so",
                   "think", "when", "which", "them", "some", "me", "people", "take", "out",
                   "into", "just", "see", "him", "your", "come", "could", "now", "than",
                   "like", "other", "how", "then", "its", "our", "two", "more", "these",
                   "want", "way", "look", "first", "also", "new", "because", "day", "more",
                   "use", "no", "man", "find", "here", "thing", "give", "many", "well"]

    if ngram in commonWords:
        return True
    else:
        return False

def cleanText(input):
    input = re.sub('\n+', " ", input).lower() # 匹配換行用空格替換成空格
    input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標(biāo)記
    input = re.sub(' +', " ", input) #  把連續(xù)多個(gè)空格替換成一個(gè)空格
    input = bytes(input)#.encode('utf-8') # 把內(nèi)容轉(zhuǎn)換成utf-8格式以消除轉(zhuǎn)義字符
    #input = input.decode("ascii", "ignore")
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ') #以空格為分隔符，返回列表


    for item in input:
        item = item.strip(string.punctuation) # string.punctuation獲取所有標(biāo)點(diǎn)符號(hào)

        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞，包括i,a等單個(gè)單詞
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)

    output = {} # 構(gòu)造字典
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')

        if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):
            pass
        else:
            if ngramTemp not in output: #詞頻統(tǒng)計(jì)
                output[ngramTemp] = 0 #典型的字典操作
            output[ngramTemp] += 1
    return output

#獲取核心詞在的句子
def getFirstSentenceContaining(ngram, content):
    #print(ngram)
    sentences = content.split(".")
    for sentence in sentences:
        if ngram in sentence:
            return sentence
    return ""

#方法一：對(duì)網(wǎng)頁(yè)直接進(jìn)行讀取
content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
#對(duì)本地文件的讀取，測(cè)試時(shí)候用，因?yàn)闊o(wú)需聯(lián)網(wǎng)
#content = open("1.txt").read()
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) # reverse=True 降序排列
print(sortedNGrams)
for top3 in range(3):
    print "###"+getFirstSentenceContaining(sortedNGrams[top3][0],content.lower())+"###"

[('united states', 10), ('general government', 4), ('executive department', 4), ('legisltive bojefferson', 3), ('same causes', 3), ('called upon', 3), ('chief magistrate', 3), ('whole country', 3), ('government should', 3),,,,巴拉巴拉一堆

### the constitution of the united states is the instrument containing this grant of power to the several departments composing the government###
### the general government has seized upon none of the reserved rights of the states###
### such a one was afforded by the executive department constituted by the constitution###

從上述栗子我們可以看出，我們對(duì)有用詞進(jìn)行了刪選，去掉了連接詞，取出核心詞排序，然后再把包含核心詞的句子抓出來(lái)，這里我只是抓了前三句，對(duì)于有兩三百個(gè)句子的文章，用三四句話概括起來(lái)，我想還是比較神奇的。

BUT

上述的方法限于主旨很明確的會(huì)議等，不然，對(duì)于小說(shuō)，簡(jiǎn)直慘目忍睹的，我試了好幾個(gè)英文小說(shuō)，簡(jiǎn)直了，總結(jié)的是啥玩意。。。。

最后

材料來(lái)自于Python網(wǎng)絡(luò)數(shù)據(jù)采集第八章，但是代碼是python3.x的,而且有一些代碼案例上跑不出來(lái)，所以整理一下，自己修改了一些代碼片段，才跑出書(shū)上的效果。

致謝

Python網(wǎng)絡(luò)數(shù)據(jù)采集[Ryan Mitchell著][人民郵電出版社]
python strip()函數(shù) 介紹
 Python中的sorted函數(shù)以及operator.itemgetter函數(shù)

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

利用N-Gram模型概括數(shù)據(jù)（Python描述）

利用N-Gram模型概括數(shù)據(jù)（Python描述）

講在開(kāi)頭

什么是N-Gram模型？

臨時(shí)補(bǔ)充

2-gram

Deeper

BUT

最后

致謝

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

利用N-Gram模型概括數(shù)據(jù)（Python描述）

講在開(kāi)頭

什么是N-Gram模型？

臨時(shí)補(bǔ)充

2-gram

Deeper

BUT

最后

致謝

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频