Python 2.7
IDE PyCharm 5.0.3
數(shù)據(jù)分析熱個(gè)身啊,反正也看到自然語言處理這塊了。。
講在開頭
此文需要用到的相關(guān)知識(shí)包括數(shù)據(jù)清洗,正則表達(dá)式,字典,列表等。不然可能有點(diǎn)費(fèi)勁。
什么是N-Gram模型?
在自然語言里有一個(gè)模型叫做n-gram,表示文字或語言中的n個(gè)連續(xù)的單詞組成序列。在進(jìn)行自然語言分析時(shí),使用n-gram或者尋找常用詞組,可以很容易的把一句話分解成若干個(gè)文字片段。摘自Python網(wǎng)絡(luò)數(shù)據(jù)采集[RyanMitchell著]
簡單來說,就是找到核心主題詞,那怎么算核心主題詞呢,一般而言,重復(fù)率也就是提及次數(shù)最多的也就是最需要表達(dá)的就是核心詞。下面的例子也就從這個(gè)開始展開
臨時(shí)補(bǔ)充
在栗子中出現(xiàn),這里拿出來單獨(dú)先試一下效果
1.string.punctuation獲取所有標(biāo)點(diǎn)符號(hào),和strip搭配使用
import string
list = ['a,','b!','cj!/n']
item=[]
for i in list:
i =i.strip(string.punctuation)
item.append(i)
print item
['a', 'b', 'cj!/n']
2.operator.itemgetter()
operator模塊提供的itemgetter函數(shù)用于獲取對象的哪些維的數(shù)據(jù),參數(shù)為一些序號(hào)(即需要獲取的數(shù)據(jù)在對象中的序號(hào))
栗子
import operator
dict={'name1':'2',
'name2':'1'}
print sorted(dict.items(),key=operator.itemgetter(0),reverse=True)
#dict.items(),鍵值對
[('name2', '1'), ('name1', '2')]
2-gram
就以兩個(gè)關(guān)鍵詞來說吧,上個(gè)栗子來進(jìn)行備注講解
import urllib2
import re
import string
import operator
def cleanText(input):
input = re.sub('\n+', " ", input).lower() # 匹配換行用空格替換成空格
input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標(biāo)記
input = re.sub(' +', " ", input) # 把連續(xù)多個(gè)空格替換成一個(gè)空格
input = bytes(input)#.encode('utf-8') # 把內(nèi)容轉(zhuǎn)換成utf-8格式以消除轉(zhuǎn)義字符
#input = input.decode("ascii", "ignore")
return input
def cleanInput(input):
input = cleanText(input)
cleanInput = []
input = input.split(' ') #以空格為分隔符,返回列表
for item in input:
item = item.strip(string.punctuation) # string.punctuation獲取所有標(biāo)點(diǎn)符號(hào)
if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞,包括i,a等單個(gè)單詞
cleanInput.append(item)
return cleanInput
def getNgrams(input, n):
input = cleanInput(input)
output = {} # 構(gòu)造字典
for i in range(len(input)-n+1):
ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')
if ngramTemp not in output: #詞頻統(tǒng)計(jì)
output[ngramTemp] = 0 #典型的字典操作
output[ngramTemp] += 1
return output
#方法一:對網(wǎng)頁直接進(jìn)行讀取
content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
#方法二:對本地文件的讀取,測試時(shí)候用,因?yàn)闊o需聯(lián)網(wǎng)
#content = open("1.txt").read()
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) #=True 降序排列
print(sortedNGrams)
[('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34),,,巴拉巴拉一堆
上述栗子作用在于抓到2連接詞的頻率大小來排序的,但是這并不是我們想要的,你說這出現(xiàn)兩百多次的 of the 有個(gè)貓用啊,所以,我們要進(jìn)行對這些連接詞啊介詞啊的剔除工作。
Deeper
# -*- coding: utf-8 -*-
import urllib2
import re
import string
import operator
#剔除常用字函數(shù)
def isCommon(ngram):
commonWords = ["the", "be", "and", "of", "a", "in", "to", "have",
"it", "i", "that", "for", "you", "he", "with", "on", "do", "say",
"this", "they", "is", "an", "at", "but","we", "his", "from", "that",
"not", "by", "she", "or", "as", "what", "go", "their","can", "who",
"get", "if", "would", "her", "all", "my", "make", "about", "know",
"will","as", "up", "one", "time", "has", "been", "there", "year", "so",
"think", "when", "which", "them", "some", "me", "people", "take", "out",
"into", "just", "see", "him", "your", "come", "could", "now", "than",
"like", "other", "how", "then", "its", "our", "two", "more", "these",
"want", "way", "look", "first", "also", "new", "because", "day", "more",
"use", "no", "man", "find", "here", "thing", "give", "many", "well"]
if ngram in commonWords:
return True
else:
return False
def cleanText(input):
input = re.sub('\n+', " ", input).lower() # 匹配換行用空格替換成空格
input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標(biāo)記
input = re.sub(' +', " ", input) # 把連續(xù)多個(gè)空格替換成一個(gè)空格
input = bytes(input)#.encode('utf-8') # 把內(nèi)容轉(zhuǎn)換成utf-8格式以消除轉(zhuǎn)義字符
#input = input.decode("ascii", "ignore")
return input
def cleanInput(input):
input = cleanText(input)
cleanInput = []
input = input.split(' ') #以空格為分隔符,返回列表
for item in input:
item = item.strip(string.punctuation) # string.punctuation獲取所有標(biāo)點(diǎn)符號(hào)
if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞,包括i,a等單個(gè)單詞
cleanInput.append(item)
return cleanInput
def getNgrams(input, n):
input = cleanInput(input)
output = {} # 構(gòu)造字典
for i in range(len(input)-n+1):
ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')
if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):
pass
else:
if ngramTemp not in output: #詞頻統(tǒng)計(jì)
output[ngramTemp] = 0 #典型的字典操作
output[ngramTemp] += 1
return output
#獲取核心詞在的句子
def getFirstSentenceContaining(ngram, content):
#print(ngram)
sentences = content.split(".")
for sentence in sentences:
if ngram in sentence:
return sentence
return ""
#方法一:對網(wǎng)頁直接進(jìn)行讀取
content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
#對本地文件的讀取,測試時(shí)候用,因?yàn)闊o需聯(lián)網(wǎng)
#content = open("1.txt").read()
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) # reverse=True 降序排列
print(sortedNGrams)
for top3 in range(3):
print "###"+getFirstSentenceContaining(sortedNGrams[top3][0],content.lower())+"###"
[('united states', 10), ('general government', 4), ('executive department', 4), ('legisltive bojefferson', 3), ('same causes', 3), ('called upon', 3), ('chief magistrate', 3), ('whole country', 3), ('government should', 3),,,,巴拉巴拉一堆
### the constitution of the united states is the instrument containing this grant of power to the several departments composing the government###
### the general government has seized upon none of the reserved rights of the states###
### such a one was afforded by the executive department constituted by the constitution###
從上述栗子我們可以看出,我們對有用詞進(jìn)行了刪選,去掉了連接詞,取出核心詞排序,然后再把包含核心詞的句子抓出來,這里我只是抓了前三句,對于有兩三百個(gè)句子的文章,用三四句話概括起來,我想還是比較神奇的。
BUT
上述的方法限于主旨很明確的會(huì)議等,不然,對于小說,簡直慘目忍睹的,我試了好幾個(gè)英文小說,簡直了,總結(jié)的是啥玩意。。。。
最后
材料來自于Python網(wǎng)絡(luò)數(shù)據(jù)采集第八章,但是代碼是python3.x的,而且有一些代碼案例上跑不出來,所以整理一下,自己修改了一些代碼片段,才跑出書上的效果。
致謝
Python網(wǎng)絡(luò)數(shù)據(jù)采集[Ryan Mitchell著][人民郵電出版社]
python strip()函數(shù) 介紹
Python中的sorted函數(shù)以及operator.itemgetter函數(shù)