Machine Learning in Python (Scikit-learn)轉(zhuǎn)人人

Machine Learning in Python (Scikit-learn)-(No.1)

作者?:范淼?（人人網(wǎng)）

1. 閑話篇

機(jī)器學(xué)習(xí)(ML)，自然語言處理(NLP)，神馬的，最近太火了。。。不知道再過幾年，大家都玩兒ML，還會不會繼續(xù)火下去。。。需要有人繼續(xù)再添點(diǎn)柴火才行。本人僅僅是一個(gè)迷途小書童，知識有限，還望各位ML大神多多指點(diǎn):)。

最近想系統(tǒng)地收拾一下ML的現(xiàn)有工具，發(fā)現(xiàn)比較好的應(yīng)該是這個(gè)http://scikit-learn.org/stable/index.html。

對于初學(xué)和進(jìn)階階段的ML研究者們是個(gè)不錯(cuò)的選擇。不過美中不足的是少了Large-scale ML的一些，畢竟這是單機(jī)的。后面琢磨琢磨，寫個(gè)ADMM(今年ICML劇多相關(guān)的論文)的吧，這個(gè)之前在MSRA的Learning Group做過一個(gè)Turtorial.

尤其是他的參考手冊，更是沒有太多廢話，都能一針見血地講明重點(diǎn)：http://scikit-learn.org/stable/user_guide.html

其實(shí)不要指望這個(gè)工具包能有啥新的東西，不過就是這些經(jīng)典的東西，要是你真掌握了，也基本God Like！了。:)，特別是你用ML創(chuàng)業(yè)的時(shí)候，可能真能用上一兩個(gè)思路，也就是被訓(xùn)練出來的思想估計(jì)是大學(xué)能留下來的，剩下的都在狗肚子里。

我們來大致瀏覽一下這個(gè)系統(tǒng)的ML工具的功能，整體內(nèi)容較多，我們逐步更新，想具體了解哪個(gè)部分的童鞋可以留言，我一下子還真很難都詳細(xì)介紹（我會基本上保證一周更新一個(gè)小章節(jié)，逐步學(xué)習(xí)。首先弄懂模型原理，講出來，然后使用對應(yīng)數(shù)據(jù)實(shí)戰(zhàn)一下，貼出代碼，作圖，最后利用測試結(jié)果適當(dāng)比較一下模型之間的差異），所有的代碼，我都會后續(xù)貼到CSDN或者Github上面。

---------------------------------------------------華麗麗的分割線---------------------------------------------------------

2. 配置篇

推薦學(xué)習(xí)配置：python 2.7, pycharm IDE （這個(gè)Python的IDE不錯(cuò)，推薦大家用下，如果用過Eclipse寫Java，這個(gè)上手會很快）， numpy, scipy。其他還有一些需要下載的包，大家可以邊配置邊有問題留言，建議在windows下面弄弄就行，我基本不用Linux。

有些小伙伴建議我也詳細(xì)講講在windows下的配置。的確，這一系列的配置還真心沒有那么簡單，我特地找了一臺windows7 Ultimiate SP1 x64 的裸機(jī)來重現(xiàn)一下整體配置過程。

首先是Python 2.7 （切記Python 3.x 和2.x的版本完全不是一路貨，不存在3.x向下兼容的問題，所以，如果哪位小伙伴為了追求軟件版本高而不小心安裝了python 3.x，我只能說。。好吧。。你被坑了。最簡單的理解，你可以認(rèn)為這兩個(gè)Python版本壓根就不是一門相同的編程語言，就連print的語法都不同）

1. Python 2.7.x ?在 x64 windows平臺下的解釋器。具體下載地址：https://www.python.org/download/releases/2.7.8/注意64位的是這個(gè)Windows X86-64 MSI Installer (2.7.8)

測試這個(gè)Python是否在你的環(huán)境里配置好，你可以在命令行里直接輸入python，如果報(bào)錯(cuò)，那么你需要手動配置一下環(huán)境，這個(gè)大家上網(wǎng)搜就可以解決（簡單說，在環(huán)境變量PATH里把你的Python的安裝文件夾路徑寫進(jìn)去）。

2. 然后安裝Pycharm，這個(gè)是我在Hulu實(shí)習(xí)的時(shí)候用到過的IDE，還是濤哥推薦的，還不錯(cuò)。因?yàn)橛姓媸召M(fèi)的問題，推薦大家下載它的(community)版http://www.jetbrains.com/pycharm/download/。安裝好后，它應(yīng)該會讓你選擇剛才安裝好的Python的解釋器，這樣你就可以做一些簡單的python編程了，用過eclipse的人，這個(gè)上手非常快。

3. 接著就需要配置跟sklearn有關(guān)的一系列Python的擴(kuò)展包了。這個(gè)美國加州一個(gè)學(xué)校的一個(gè)非官方網(wǎng)站張貼了所有windows直接安裝的版本http://www.lfd.uci.edu/~gohlke/pythonlibs/，特別實(shí)用，大家到里面去下載跟python 2.7 amd64有關(guān)的安裝包。然后直接下載運(yùn)行即可。需要下載的一系列擴(kuò)展包的列表（按照依賴順序）：Numpy-MKL, SciPy, Scikit-learn。有了這些就可以學(xué)習(xí)Scikit-learn這個(gè)工具包了。

4. 此外，如果想像我一樣，同時(shí)可以畫圖，那么就需要matplotlib，這個(gè)也有一個(gè)網(wǎng)站手冊http://matplotlib.org/contents.html，同樣也需要一系列擴(kuò)展包的支持。使用matplotlib 需要如下必備的庫，numpy,dateutil,pytz,pyparsing,six。都能從剛才我推薦的下載網(wǎng)站上獲取到。

上面的一系列都搞定了，大家可以使用我第一個(gè)線性回歸的代碼（加粗的代碼）測試一下，直接輸出圖像，最后還能保存成為png格式的圖片。

------------------------------華麗麗的分割線------------------------------------------

3. 數(shù)據(jù)篇

用工具之前先介紹幾個(gè)我會用到的數(shù)據(jù)

這里大部分的數(shù)據(jù)都是從這個(gè)經(jīng)典的機(jī)器學(xué)習(xí)網(wǎng)站提供的：

https://archive.ics.uci.edu/ml/

sklearn.datasets里面集成了這個(gè)網(wǎng)站里的部分?jǐn)?shù)據(jù)（剛接觸Python的童鞋，需要一點(diǎn)點(diǎn)Python的知識，和Java類似，使用現(xiàn)成工具模塊的時(shí)候，需要import一下，我們這個(gè)基于Python的機(jī)器學(xué)習(xí)工具包的全名是sklearn，這里介紹數(shù)據(jù)，所以下一個(gè)目錄是datasets）。具體的Python代碼：

import sklearn.datasets

數(shù)據(jù)一：波士頓房價(jià)（適合做回歸），以后直接用boston標(biāo)記

這行代碼就讀進(jìn)來了

boston = sklearn.datasets.load_boston()

查詢具體數(shù)據(jù)說明，用這個(gè)代碼：

print boston.DESCR

輸出如下：

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

一共506組數(shù)據(jù)，13維特征，

比如第一個(gè)維度的特征是犯罪率，第六個(gè)是每個(gè)房子平均多少房間等等。

boston.data 獲取這506 * 13的特征數(shù)據(jù)

boston.target 獲取對應(yīng)的506 * 1的對應(yīng)價(jià)格

數(shù)據(jù)二：牽牛花（適合做簡單分類），標(biāo)記為Iris

import sklearn.datasets

iris = sklearn.datasets.load_iris()

iris.data 獲取特征

iris.target 獲取對應(yīng)的類別

Data Set Characteristics:

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class

:Attribute Information:

- sepal length in cm

- sepal width in cm

- petal length in cm

- petal width in cm

- class:

- Iris-Setosa

- Iris-Versicolour

- Iris-Virginica

這個(gè)數(shù)據(jù)基本是個(gè)ML的入門選手都知道，一共三類牽牛花，獲取特征和對應(yīng)的類別標(biāo)簽也是同上

一共150樣本，3類，特征維度為4

數(shù)據(jù)三：糖尿病（回歸問題），diabetes

這個(gè)數(shù)據(jù)包很奇怪，沒有描述。我也到原本的UCI的網(wǎng)站上查了一下，也是沒有太好的描述。

import sklearn.datasets

diabetes = sklearn.datasets.load_diabetes()

print diabetes.keys()

這樣的輸出只有data, targets。

我也觀察了一下數(shù)據(jù)，感覺是經(jīng)過額外的歸一化處理的，原始的數(shù)據(jù)樣貌已經(jīng)看不出來了。。

下面這個(gè)圖是我從網(wǎng)站上Copy下來的有限的描述，樣本量為442，特征維度為10，每個(gè)特征元素的值都是連續(xù)的實(shí)數(shù)，在正負(fù)0.2之間。。目標(biāo)這個(gè)整數(shù)值有可能是血糖。

Samples total442

Dimensionality10

Featuresreal, -.2 < x < .2

Targetsinteger 25 - 346

數(shù)據(jù)四：手寫數(shù)字識別（多類分類，10個(gè)類別，從0-9）digits

import sklearn.datasets

digits = sklearn.datasets.load_digits()

總體樣本量：1797，每個(gè)類別大約180個(gè)樣本，每個(gè)手寫數(shù)字是一個(gè)8*8的圖片，每個(gè)像素是0-16的整數(shù)值。

綜上，大家可以加載相應(yīng)的數(shù)據(jù)來玩，這幾個(gè)數(shù)據(jù)算是比較有代表性的。后面會介紹如何利用SKLEARN工具下載更大規(guī)模的數(shù)據(jù)，比如MINIST的大規(guī)模的手寫數(shù)字識別庫等等。

總之，如果你想獲取特征，就在*.data里，對應(yīng)的類別或者回歸值在*.target里面

光說不練不行，我對每個(gè)介紹的方法都會選用上面的Dataset實(shí)際測試一下，并且會酌情給出結(jié)果和圖像。

------------------------------華麗麗的分割線------------------------------------------

4.實(shí)戰(zhàn)篇

1. Supervised learning

這個(gè)監(jiān)督學(xué)習(xí)最常用，分類啊，預(yù)測回歸（預(yù)測個(gè)股票啥的，雖然在我大天朝不太適合）啊。

1.1. Generalized Linear Models

最通用的線性模型

把你的特征x和對應(yīng)的權(quán)重w相加，最后爭取接近你的目標(biāo)y，機(jī)器學(xué)的就是w。

這個(gè)模型應(yīng)用最廣，其實(shí)就是大家會權(quán)衡各種各樣的因素，最后給一個(gè)總評。

1.1.1. Ordinary Least Squares最小二乘約束

目標(biāo)函數(shù)是這個(gè)

。

要總體的平方和最小。

具體代碼大家import sklearn.linear_model，然后sklearn.linear_model.LinearRegression()就是這個(gè)模塊了。做個(gè)簡單的什么房價(jià)估計(jì)還行（別說預(yù)測，那個(gè)不準(zhǔn)，只能說估計(jì)一下租房的價(jià)格，隨便在搜房網(wǎng)上弄點(diǎn)兒數(shù)據(jù)，他那里有現(xiàn)成的特征，什么地理位置啊，面積啊，朝向啊等等，最后你回歸一個(gè)大致房價(jià)玩玩）。

我們就使用波士頓的房價(jià)來預(yù)測一下（后面的所有python代碼注意縮進(jìn)！我是沒工夫一行一行調(diào)整了。。。多包涵）：

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.linear_model

import numpy.random

import numpy.linalg

import matplotlib.pyplot

if __name__ == "__main__":

# Load boston dataset

boston = sklearn.datasets.load_boston()

# Split the dataset with sampleRatio

sampleRatio = 0.5

n_samples = len(boston.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = boston.data[shuffleIdx[:sampleBoundary]]

train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = boston.data[shuffleIdx[sampleBoundary:]]

test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train

linearRegression = sklearn.linear_model.LinearRegression()

linearRegression.fit(train_features, train_targets)

# Predict

predict_targets = linearRegression.predict(test_features)

# Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples

print "Ordinary Least Squares (Boston) Error: %.2f" %(error)

# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')

matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.title("Ordinary Least Squares (Boston)")

matplotlib.pyplot.ylabel("Price")

matplotlib.pyplot.savefig("Ordinary Least Squares (Boston).png", format='png')

matplotlib.pyplot.show()

結(jié)果：

Ordinary Least Squares (Boston) Error:3.35。基本上，平均每筆預(yù)測，都會距離真實(shí)的價(jià)格差3350美金，這個(gè)數(shù)值的單位是1000 U.S.D. （見數(shù)據(jù)描述）

下面這個(gè)圖就是預(yù)測和實(shí)際價(jià)格的對比圖線，這里是隨機(jī)采樣了50%作為訓(xùn)練，50%做預(yù)測，效果還行，看來這個(gè)線性模型還可以接受。

1.1.2. Ridge Regression

這個(gè)中文一般叫嶺回歸，就是在上面的目標(biāo)函數(shù)上加個(gè)正則項(xiàng)，嶺回歸用二范數(shù)(L2 norm)。

這個(gè)范數(shù)的目的在于對整體學(xué)習(xí)到的權(quán)重都控制得比較均衡，因?yàn)槲覀兊臄?shù)據(jù)不能保證非常正常，有的時(shí)候，接近線性相關(guān)的那些噪聲樣本會加劇權(quán)重系數(shù)的非均衡學(xué)習(xí)，最后就是這個(gè)樣子

一旦某個(gè)特征噪音比較大，剛好那個(gè)權(quán)重也不小，那回歸結(jié)果就慘了。

好，我們再用波士頓的房價(jià)試試嶺回歸。

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.linear_model

import numpy.random

import numpy.linalg

import matplotlib.pyplot

if __name__ == "__main__":

# Load boston dataset

boston = sklearn.datasets.load_boston()

# Split the dataset with sampleRatio

sampleRatio = 0.5

n_samples = len(boston.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = boston.data[shuffleIdx[:sampleBoundary]]

train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = boston.data[shuffleIdx[sampleBoundary:]]

test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train with Cross Validation

ridgeRegression = sklearn.linear_model.RidgeCV(alphas=[0.01, 0.05, 0.1, 0.5, 1.0, 10.0])

這個(gè)地方使用RidgeCV 直接交叉驗(yàn)證出我需要試驗(yàn)的幾個(gè)懲罰因子，它會幫我選擇這些里面在集內(nèi)測試表現(xiàn)最優(yōu)的一個(gè)參數(shù)。后面的輸出選擇了0.1。

ridgeRegression.fit(train_features, train_targets)

print "Alpha = ", ridgeRegression.alpha_

# Predict

predict_targets = ridgeRegression.predict(test_features)

# Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples

print "Ridge Regression (Boston) Error: %.2f" %(error)

# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')

matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.title("Ridge Regression (Boston)")

matplotlib.pyplot.ylabel("Price (1000 U.S.D)")

matplotlib.pyplot.savefig("Ridge Regression (Boston).png", format='png')

matplotlib.pyplot.show()

輸出:

Alpha = 0.1

Ridge Regression (Boston) Error: 3.21

基本上，這樣的結(jié)果，誤差在3210美金左右，比之前的最一般的線性模型好一點(diǎn)。而且，這種情況下，基本上預(yù)測出來的圖線的方差比較小，振幅略小，因?yàn)橛蠷idge的懲罰項(xiàng)的約束，保證每個(gè)特征的變化不會對整體預(yù)測有過大的影響

1.1.3. Lasso

老是聽MSRA的師兄說這個(gè)，貌似還挺火的一個(gè)研究，這里面就是把二范數(shù)（L2）換成一范數(shù)（L1）。

絕對值的這個(gè)約束，更想讓學(xué)習(xí)到的權(quán)重稀疏一些，壓縮感知啥的跟這個(gè)有關(guān)。

這個(gè)估計(jì)不會有太大的性能提升，對于Boston數(shù)據(jù)，因?yàn)楸緛硖卣骶筒幌∈瑁竺婵梢栽囋噉ewsgroup20。那個(gè)夠稀疏。

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.linear_model

import numpy.random

import numpy.linalg

import matplotlib.pyplot

if __name__ == "__main__":

# Load boston dataset

boston = sklearn.datasets.load_boston()

# Split the dataset with sampleRatio

sampleRatio = 0.5

n_samples = len(boston.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = boston.data[shuffleIdx[:sampleBoundary]]

train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = boston.data[shuffleIdx[sampleBoundary:]]

test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train

lasso = sklearn.linear_model.LassoCV(alphas=[0.01, 0.05, 0.1, 0.5, 1.0, 10.0])

lasso.fit(train_features, train_targets)

print "Alpha = ", lasso.alpha_

# Predict

predict_targets = lasso.predict(test_features)

# Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples

print "Lasso (Boston) Error: %.2f" %(error)

# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')

matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.title("Lasso (Boston)")

matplotlib.pyplot.ylabel("Price (1000 U.S.D)")

matplotlib.pyplot.savefig("Lasso (Boston).png", format='png')

matplotlib.pyplot.show()

輸出：

Alpha = 0.01

Lasso (Boston) Error: 3.39

這個(gè)結(jié)果的振幅還是比較大的。特別是對于低價(jià)位的振幅。

1.1.4. Elastic Net

這個(gè)不知道中文怎么說合適，其實(shí)就是兼顧了上面兩個(gè)正則項(xiàng)（L1和L2兩個(gè)先驗(yàn)（Prior）），既保證能夠訓(xùn)練出一組比較稀疏的模型（Lasso的貢獻(xiàn)），同時(shí)還能兼具嶺回歸L2的好處。這個(gè)我沒試過，不知道啥樣的數(shù)據(jù)這么做最合適，回頭我試幾個(gè)數(shù)據(jù)集，比較一下普通的線性回歸和這個(gè)模型的性能。

很自然地，要用一個(gè)額外的參數(shù)來平衡這兩個(gè)先驗(yàn)約束，一個(gè)是懲罰因子alpha，這個(gè)之前也有，另一個(gè)就是

。這些參數(shù)都可以用交叉驗(yàn)證CV來搞定（每個(gè)線性模型都有相應(yīng)的CV方法，比如ElasticNetCV就是用來干這個(gè)的，其實(shí)這種CV方法就是模型選擇的范疇了，因?yàn)槊總€(gè)不同的額外參數(shù)，不是你要學(xué)習(xí)的W。比如懲罰因子，平衡因子等等，這些構(gòu)成了不同的數(shù)學(xué)模型，CV的目標(biāo)就是來選擇合適的模型，然后再去學(xué)習(xí)W）。這把來個(gè)大鍋燴，兩種范數(shù)都用上了：

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.linear_model

import numpy.random

import numpy.linalg

import matplotlib.pyplot

if __name__ == "__main__":

# Load boston dataset

boston = sklearn.datasets.load_boston()

# Split the dataset with sampleRatio

sampleRatio = 0.5

n_samples = len(boston.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = boston.data[shuffleIdx[:sampleBoundary]]

train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = boston.data[shuffleIdx[sampleBoundary:]]

test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train

elasticNet = sklearn.linear_model.ElasticNetCV(alphas=[0.01, 0.05, 0.1, 0.5, 1.0, 10.0], l1_ratio=[0.1,0.3,0.5,0.7,0.9])

elasticNet.fit(train_features, train_targets)

print "Alpha = ", elasticNet.alpha_

print "L1 Ratio = ", elasticNet.l1_ratio_

# Predict

predict_targets = elasticNet.predict(test_features)

# Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples

print "Elastic Net (Boston) Error: %.2f" %(error)

# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')

matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.title("Elastic Net (Boston)")

matplotlib.pyplot.ylabel("Price (1000 U.S.D)")

matplotlib.pyplot.savefig("Elastic Net (Boston).png", format='png')

matplotlib.pyplot.show()

輸出：

Alpha = 0.01

L1 Ratio = 0.9

Elastic Net (Boston) Error: 3.14

貌似還是混合所有制比較牛逼！知道這年頭審論文最怕遇到題目里面有啥么？Hybird...，這尼瑪性能不提升都對不起這個(gè)單詞。。。

1.1.10. Logistic regression

這里補(bǔ)充一個(gè)比較實(shí)用的邏輯斯蒂回歸，雖然名字叫這個(gè)，但是一般用在分類上。

采用這個(gè)函數(shù)來表達(dá)具體樣本的特征加權(quán)組合能夠分到哪個(gè)類別上（注：下面的圖片來自博客http://blog.csdn.net/marvin521/article/details/9263483）

下面的這個(gè)sigmod函數(shù)對于z值特別敏感，但是他的優(yōu)點(diǎn)在于他是連續(xù)可導(dǎo)的，這個(gè)非常重要，便于我們用梯度法計(jì)算W。

事實(shí)證明，Logistic Regression做分類非常好用也很易用，據(jù)說Goolge對點(diǎn)擊率CTR的預(yù)測也會用到這個(gè)模型，這個(gè)我沒有考證過，只是聽說，不過下面的代碼對Iris的分類結(jié)果倒是也能說明這個(gè)做分類也是挺好用的（這里強(qiáng)調(diào)，我們經(jīng)常看到Logistic Regression用來做二分類，事實(shí)上它可以拓展到對多類分類上，我這里不過多介紹，大家可以查Softmax Regression做參考）。

我們使用Iris的數(shù)據(jù)來測試一下：

大致回顧一下Iris（牽牛花（數(shù)據(jù)篇有詳細(xì)介紹））的數(shù)據(jù)特點(diǎn)：150個(gè)樣本，3類，每類基本50條數(shù)據(jù)，每個(gè)數(shù)據(jù)條目4中特征，都是連續(xù)數(shù)值類型。我們的目標(biāo)就是把隨機(jī)抽取的50%（切記要隨機(jī)打亂數(shù)據(jù)，這個(gè)數(shù)據(jù)原始的順序不是打亂的，前50條都是一個(gè)類別，別弄錯(cuò)了。）的數(shù)據(jù)做個(gè)類別0,1,2的預(yù)測。

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.linear_model

import numpy.random

import matplotlib.pyplot

if __name__ == "__main__":

# Load iris dataset

iris = sklearn.datasets.load_iris()

# Split the dataset with sampleRatio

sampleRatio = 0.5

n_samples = len(iris.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = iris.data[shuffleIdx[:sampleBoundary]]

train_targets = iris.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = iris.data[shuffleIdx[sampleBoundary:]]

test_targets = iris.target[shuffleIdx[sampleBoundary:]]

# Train

logisticRegression = sklearn.linear_model.LogisticRegression()

logisticRegression.fit(train_features, train_targets)

# Predict

predict_targets = logisticRegression.predict(test_features)

# Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

correctNum = 0

for i in X:

if predict_targets[i] == test_targets[i]:

correctNum += 1

accuracy = correctNum * 1.0 / n_test_samples

print "Logistic Regression (Iris) Accuracy: %.2f" %(accuracy)

# Draw

matplotlib.pyplot.subplot(2, 1, 1)

matplotlib.pyplot.title("Logistic Regression (Iris)")

matplotlib.pyplot.plot(X, predict_targets, 'ro-', label = 'Predict Labels')

matplotlib.pyplot.ylabel("Predict Class")

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.subplot(2, 1, 2)

matplotlib.pyplot.plot(X, test_targets, 'g+-', label='True Labels')

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.ylabel("True Class")

matplotlib.pyplot.savefig("Logistic Regression (Iris).png", format='png')

matplotlib.pyplot.show()

輸出：

Logistic Regression (Iris) Accuracy: 0.95

使用50%作訓(xùn)練，50%做測試，分類精度可以達(dá)到95%。

下面這個(gè)圖算是一個(gè)直觀的輔助，因?yàn)榉诸惥缺容^高，所以預(yù)測類別和真實(shí)類別對應(yīng)的走勢幾乎相同：

字?jǐn)?shù)要超了，繼續(xù)讀，可以點(diǎn)擊，進(jìn)入No.2：

http://blog.renren.com/blog/bp/Q7Vlj0xW7D

接著之前No.1，我們繼續(xù)。

之前的易懂的線性模型基本走了一遭，我們再看看，如果數(shù)據(jù)的特征因素是復(fù)合的，平方的，立方的（也就是多項(xiàng)式回歸會怎么樣？）。我覺得這種東西沒有定論，誰也不能確定特征組合會不會有道理，再說的直白點(diǎn)，到底特征是不是幫助我們機(jī)器學(xué)習(xí)的有效利器，也沒有定論，但是至少目前看還是有效的。

1.1.15. Polynomial regression: extending linear models with basis functions

我們之前都是關(guān)注，怎么找到特征的線性組合，但是事實(shí)上，不可能都是線性組合，房價(jià)也許從某個(gè)特征（比如有一個(gè)特征是房子的平均面積，這個(gè)和價(jià)格有可能是線性關(guān)系；但是如果是這個(gè)地區(qū)的房子的數(shù)量，這個(gè)很難講，有可能就不是線性的，有可能是平方的，也有可能是其他復(fù)雜的關(guān)系，比如邏輯斯蒂關(guān)系，因?yàn)榄h(huán)境飽和有可能造成房價(jià)持平甚至下跌）。我們這里考慮這種多項(xiàng)式組合的特征關(guān)系。

這是原來的特征線性組合

這個(gè)就是特征的二項(xiàng)式組合，

我們來看看代碼上，怎么來處理，還是用房價(jià)的數(shù)據(jù)。

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.linear_model

import numpy.random

import numpy.linalg

import matplotlib.pyplot

import sklearn.preprocessing

if __name__ == "__main__":

# Load boston dataset

boston = sklearn.datasets.load_boston()

# Data tranform

polynominalData = sklearn.preprocessing.PolynomialFeatures(degree=2).fit_transform(boston.data)

# Split the dataset with sampleRatio

sampleRatio = 0.5

n_samples = len(boston.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = polynominalData[shuffleIdx[:sampleBoundary]]

train_targets = boston.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = polynominalData[shuffleIdx[sampleBoundary:]]

test_targets = boston.target[shuffleIdx[sampleBoundary:]]

# Train

linearRegression = sklearn.linear_model.LinearRegression()

linearRegression.fit(train_features, train_targets)

# Predict

predict_targets = linearRegression.predict(test_features)

# Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

error = numpy.linalg.norm(predict_targets - test_targets, ord = 1) / n_test_samples

print "Polynomial Regression (Degree = 2) (Boston) Error: %.2f" %(error)

# Draw

matplotlib.pyplot.plot(X, predict_targets, 'r--', label = 'Predict Price')

matplotlib.pyplot.plot(X, test_targets, 'g:', label='True Price')

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.title("Polynomial Regression (Degree = 2) (Boston)")

matplotlib.pyplot.ylabel("Price (1000 U.S.D)")

matplotlib.pyplot.savefig("Polynomial Regression (Degree = 2) (Boston).png", format='png')

matplotlib.pyplot.show()

這份代碼里，我使用的是二項(xiàng)式特征轉(zhuǎn)換，最高階次是2。然后使用普通的線性擬合，

輸出：

Polynomial Regression (Degree = 2) (Boston) Error: 3.26

誤差在3260美金上下，我記得之前的普通的線性回歸是3350。略好一點(diǎn)點(diǎn)。

有些喜歡質(zhì)疑的同學(xué)也許會問，我這代碼會不會有問題？沒關(guān)系，我們繼續(xù)延伸一個(gè)小話題，如果我們只修改一個(gè)地方：

# Data tranform

polynominalData = sklearn.preprocessing.PolynomialFeatures(degree=4).fit_transform(boston.data)，改成4階的，會怎么樣呢？后果不堪設(shè)想。。。

輸出：

Polynomial Regression (Degree = 4) (Boston) Error: 30.19

誤差達(dá)到3W美金，這模型完全不能用了。

大家可以看到，預(yù)測價(jià)格（紅色虛線）的震動非常強(qiáng)烈，而真實(shí)價(jià)格基本在30左右徘徊（綠色的虛線）。這說明你的模型在對測試數(shù)據(jù)的泛化能力上非常差。但是有人一定會問：“我設(shè)計(jì)的4階模型應(yīng)該比2階的考慮的特征組合要多得多啊，怎么會測試的時(shí)候這么差？” 是啊，考慮全面了，還這么差，我只能說“您想多了”。事實(shí)上，沒有那么多數(shù)據(jù)夠你合理地調(diào)整參數(shù)，因?yàn)槟愕哪Ｐ瓦^于復(fù)雜。這種情況叫做過擬合（overfitting）。上面的圖片顯示的就是典型的過擬合。那么如果你的模型本身就是二次的，你用線性回歸，那么效果也會略差，這種情況叫做欠擬合（underfitting）

在大數(shù)據(jù)時(shí)代，深度學(xué)習(xí)的模型參數(shù)非常多，但是數(shù)據(jù)也多，這樣復(fù)雜模型本身的強(qiáng)大的表達(dá)能力得以展現(xiàn)，這是我覺得為什么在圖像，語音這些領(lǐng)域，深度學(xué)習(xí)這么有效的簡單原因。

---------------------------------------------------------------------------------------------------------------------------------

1.2. Support Vector Machines

支持向量機(jī)的歷史命運(yùn)特別像諾基亞，曾經(jīng)輝煌很長一段時(shí)間，盡管現(xiàn)在已經(jīng)成為歷史，但是終究不能磨滅期偉大貢獻(xiàn)。應(yīng)該是上個(gè)世紀(jì)90年代，幾乎在學(xué)術(shù)界充斥了大量的關(guān)于SVM的話題論文。要是那個(gè)時(shí)候誰不知道SVM，就跟現(xiàn)在不知道深度學(xué)習(xí)似的，不知道要遭到多少鄙視:)。其實(shí)我也不懂深度學(xué)習(xí)。。。被鄙視習(xí)慣了，也就見慣不慣了。

我們的這個(gè)sklearn系列的討論帖不在于介紹數(shù)學(xué)細(xì)節(jié)，更關(guān)注怎么用，什么情況下使用什么模型更適合。因此我同意下面的四條關(guān)于SVM的優(yōu)勢的總結(jié)，這些總結(jié)從側(cè)面告訴你什么時(shí)候用SVM：

a. 高維度特征數(shù)據(jù)有效

b. 訓(xùn)練樣本數(shù)量小于特征維數(shù)的數(shù)據(jù)有效（這個(gè)特別霸氣）

c. 節(jié)約模型的存儲內(nèi)存（就那么幾個(gè)支持向量有用）

d. 還可以根據(jù)需要對特征進(jìn)行高維變化（核函數(shù)的方法）

1.2.1. Classification

SVM用來做Classification，縮寫就是SVC（Support Vector Classification）（SVM不僅僅能做分類，這個(gè)一定要說明）的基本思想非常直觀，也是要找一個(gè)超平面（2類分類），但是要找最好的那個(gè)。下圖來自博文：http://blog.csdn.net/marvin521/article/details/9286099。我們可以看到，類似B,C的分隔線可以有無數(shù)個(gè)，都能分離藍(lán)色和紅色的兩個(gè)類別，但是貌似D的分類方式更讓人接受，好像如果有一個(gè)新的數(shù)據(jù)，大體上像D這樣劃分更容易對，是吧。這里D的方式就是找到了已知數(shù)據(jù)分布的最大間隔，有充足的泛化空間讓給那些沒有看到的數(shù)據(jù)，這樣模型的泛化能力達(dá)到了最大（機(jī)器學(xué)習(xí)的關(guān)鍵問題不在于模型在訓(xùn)練樣本上的契合程度，在于泛化能力如何，雖然這是很難評估的），這是為什么SVM在90年代的時(shí)候風(fēng)靡一時(shí)的原因，它也的確好使。

再來看，其實(shí)像D這樣的分隔線的確定貌似不太依賴那些遠(yuǎn)離分隔線的數(shù)據(jù)點(diǎn)，只有那些距離分割線（如果是更多維度的特征，那就是分隔超平面）最近的一些點(diǎn)能夠支持分割線確定位置，因此叫支持向量機(jī)。而那些用來確定分割線的有效數(shù)據(jù)點(diǎn)（特征向量），叫做支持向量。

來，我們用代碼找找感覺：

這里需要說明一下：如果我們繼續(xù)使用Iris的數(shù)據(jù)，這是一個(gè)多類別（3個(gè)類別）的分類問題，我覺得大家需要大致了解一下SVC這套工具是怎么處理多類分類的問題的（畢竟，我們給出的例子是2類分類的）。

大體上有兩種，將兩類分類器擴(kuò)展到多類分類問題，我這里強(qiáng)調(diào)，不是只有兩種，而是，將兩類分類問題進(jìn)行擴(kuò)展，達(dá)到多（假設(shè)有n個(gè)類別) 分類的目的，這個(gè)思路有兩種：一種是訓(xùn)練n*(n-1)/ 2個(gè)二類分類器，兩兩類別之間訓(xùn)練一個(gè)分類器，用于專門處理；另外一種就是把其中一個(gè)類別拿出來作為正類別，其他的所有類別統(tǒng)一歸為負(fù)類，這樣會訓(xùn)練n個(gè)訓(xùn)練樣本。

用Iris的數(shù)據(jù)我們都來試試。

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.svm

import numpy.random

import matplotlib.pyplot

import matplotlib.colors

if __name__ == "__main__":

# Load iris dataset

iris = sklearn.datasets.load_iris()

# Split the dataset with sampleRatio

sampleRatio = 0.5

n_samples = len(iris.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = iris.data[shuffleIdx[:sampleBoundary]]

train_targets = iris.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = iris.data[shuffleIdx[sampleBoundary:]]

test_targets = iris.target[shuffleIdx[sampleBoundary:]]

# Train

svc = sklearn.svm.SVC()

nusvc = sklearn.svm.NuSVC()

linearsvc = sklearn.svm.LinearSVC()

svc.fit(train_features, train_targets)

nusvc.fit(train_features, train_targets)

linearsvc.fit(train_features, train_targets)

predict_targets = svc.predict(test_features)

#SVC Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

correctNum = 0

for i in X:

if predict_targets[i] == test_targets[i]:

correctNum += 1

accuracy = correctNum * 1.0 / n_test_samples

print "SVC Accuracy: %.2f" %(accuracy)

predict_targets = nusvc.predict(test_features)

#NuSVC Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

correctNum = 0

for i in X:

if predict_targets[i] == test_targets[i]:

correctNum += 1

accuracy = correctNum * 1.0 / n_test_samples

print "NuSVC Accuracy: %.2f" %(accuracy)

predict_targets = linearsvc.predict(test_features)

#LinearSVC Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

correctNum = 0

for i in X:

if predict_targets[i] == test_targets[i]:

correctNum += 1

accuracy = correctNum * 1.0 / n_test_samples

print "LinearSVC Accuracy: %.2f" %(accuracy)

1.3. Stochastic Gradient Descent

1.4. Nearest Neighbors

1.4.2. Nearest Neighbors Classification

借著剛剛更新過的Logistic Regression 對 Iris做分類的余興，我們來看看使用近鄰法是怎么做分類（近鄰法不僅能做分類，還能回歸，我先介紹分類，這個(gè)比較好懂）的。這個(gè)算是基于實(shí)例的分類方法，和前面介紹的回歸啊，分類啊這些方法都不同，之前都是要訓(xùn)練出一個(gè)具體的數(shù)學(xué)函數(shù)，對吧。這種近鄰法不需要預(yù)先訓(xùn)練出什么公式。近鄰法的思想很簡單，“物以類聚，人以群分”，特征相似的，類別最相近。KNN（K Nearest Neighbor）的意思就是在某個(gè)待分類的樣本周圍找K個(gè)根據(jù)特征度量距離最近的K個(gè)已知類別的樣本，這K個(gè)樣本里面，如果某個(gè)類別個(gè)數(shù)最多，那么這個(gè)待分類的樣本就從屬于那個(gè)類別。意思就是，找特性最相近的朋黨，然后少數(shù)服從多數(shù)。

當(dāng)然，這個(gè)工具包也沒有那么簡單，除了KNN（KNeighborsClassifier）還有RNN（RadiusNeighborsClassifier），說白了，KNN不在乎那K個(gè)最近的點(diǎn)到底離你有多遠(yuǎn)，反正總有相對最近的K個(gè)。但是RNN要考慮半徑Radius，在待測樣本以Radius為半徑畫個(gè)球（如果是二維特征就是圓，三維特征以上，你可以理解為一個(gè)超球面），這個(gè)球里面的都算進(jìn)來，這樣就不能保證每個(gè)待測樣本都能考慮相同數(shù)量的最近樣本。

同時(shí)，我們也可以根據(jù)距離的遠(yuǎn)近來對這些已知類別的樣本的投票進(jìn)行加權(quán)，這個(gè)想法當(dāng)然很自然。后面的代碼都會體現(xiàn)。

我們還是用Iris來測試一下，這次采樣比例弄得狠了點(diǎn)，20%訓(xùn)練，80%用來預(yù)測測試，就是為了區(qū)別一下兩種距離加權(quán)方式[unifrom, distance]。

'''

Author: Miao Fan

Affiliation: Department of Computer Science and Technology, Tsinghua University, P.R.China.

Email: fanmiao.cslt.thu@gmail.com

'''

import sklearn.datasets

import sklearn.neighbors

import numpy.random

import matplotlib.pyplot

import matplotlib.colors

if __name__ == "__main__":

# Load iris dataset

iris = sklearn.datasets.load_iris()

# Split the dataset with sampleRatio

sampleRatio = 0.2

n_samples = len(iris.target)

sampleBoundary = int(n_samples * sampleRatio)

# Shuffle the whole data

shuffleIdx = range(n_samples)

numpy.random.shuffle(shuffleIdx)

# Make the training data

train_features = iris.data[shuffleIdx[:sampleBoundary]]

train_targets = iris.target[shuffleIdx [:sampleBoundary]]

# Make the testing data

test_features = iris.data[shuffleIdx[sampleBoundary:]]

test_targets = iris.target[shuffleIdx[sampleBoundary:]]

# Train

n_neighbors = 5 #選5個(gè)最近鄰

for weights in ['uniform', 'distance']: #這個(gè)地方采用兩種加權(quán)方式

kNeighborsClassifier = sklearn.neighbors.KNeighborsClassifier(n_neighbors, weights=weights)

kNeighborsClassifier.fit(train_features, train_targets)

# Test

predict_targets = kNeighborsClassifier.predict(test_features)

#Evaluation

n_test_samples = len(test_targets)

X = range(n_test_samples)

correctNum = 0

for i in X:

if predict_targets[i] == test_targets[i]:

correctNum += 1

accuracy = correctNum * 1.0 / n_test_samples

print "K Neighbors Classifier (Iris) Accuracy [weight = '%s']: %.2f" %(weights, accuracy)

# Draw

cmap_bold = matplotlib.colors.ListedColormap(['red', 'blue', 'green'])

X_test = test_features[:, 2:4]

X_train = train_features[:, 2:4]

matplotlib.pyplot.scatter(X_train[:, 0], X_train[:, 1], label = 'train samples', marker='o', c = train_targets, cmap=cmap_bold,)

matplotlib.pyplot.scatter(X_test[:,0], X_test[:, 1], label = 'test samples', marker='+', c = predict_targets, cmap=cmap_bold)

legend = matplotlib.pyplot.legend()

matplotlib.pyplot.title("K Neighbors Classifier (Iris) [weight = %s]" %(weights))

matplotlib.pyplot.savefig("K Neighbors Classifier (Iris) [weight = %s].png" %(weights), format='png')

matplotlib.pyplot.show()

輸出：

K Neighbors Classifier (Iris) Accuracy [weight = 'uniform']: 0.91

K Neighbors Classifier (Iris) Accuracy [weight = 'distance']: 0.93

加權(quán)方法略好一點(diǎn)，大約提升2%的精度（注意這兩個(gè)圖，我只是采用了其中的兩個(gè)維度特征進(jìn)行的重建，事實(shí)上應(yīng)該有4個(gè)維度）：

1.5. Gaussian Processes

1.6. Cross decomposition

1.7. Naive Bayes

1.8. Decision Trees

1.9. Ensemble methods

1.10. Multiclass and multilabel algorithms

1.11. Feature selection

1.12. Semi-Supervised

1.13. Linear and quadratic discriminant analysis

1.14. Isotonic regression

2. Unsupervised learning

然后讓我們開始無監(jiān)督學(xué)習(xí)：（聚類啊，概率密度估計(jì)（離群點(diǎn)檢測）啊，數(shù)據(jù)降維啊）等等。相對而言，這個(gè)部分的工具還是比起許多其他ML包要豐富地多！什么流形學(xué)習(xí)啊都有。

2.1. Gaussian mixture models

2.2. Manifold learning

2.3. Clustering

2.4. Biclustering

2.5. Decomposing signals in components (matrix factorization problems)

2.6. Covariance estimation

2.7. Novelty and Outlier Detection

2.8. Density Estimation

2.9. Neural network models (unsupervised)

3. Model selection and evaluation

模型選擇有的時(shí)候，特別是在使用ML創(chuàng)業(yè)的時(shí)候更需要把握。其實(shí)好多問題不同模型都差不多到80%精度，后面怎么提升才是重點(diǎn)。不止一個(gè)小伙伴想要用Deep Learning 這個(gè)話題作為噱頭準(zhǔn)備9月份的博士或者碩士開題，那玩意兒想做好，你還真得有耐心調(diào)參數(shù)，回想起MSRA我那同一排的大嬸（神）們，都是NIPS啊！！！丫的，1%的提升都要尖叫了:)，其實(shí)我想說，妹的，參數(shù)不一樣唄。。。這就是Black Magic（黑魔法）。玩深度學(xué)習(xí)的多了，估計(jì)以后不是模型值錢，是參數(shù)值錢了。

另外就是特征選擇，這個(gè)玩意兒也有講究，如果真正用ML創(chuàng)業(yè)，其實(shí)模型還是那些模型，特征和參數(shù)的選擇往往更能看出這個(gè)人的水平，別瞎試，千萬別。。。

3.1. Cross-validation: evaluating estimator performance

3.2. Grid Search: Searching for estimator parameters

3.3. Pipeline: chaining estimators

3.4. FeatureUnion: Combining feature extractors

3.5. Model evaluation: quantifying the quality of predictions

3.6. Model persistence

3.7. Validation curves: plotting scores to evaluate models

4. Dataset transformations

4.1. Feature extraction

4.2. Preprocessing data

4.3. Kernel Approximation

4.4. Random Projection

4.5. Pairwise metrics, Affinities and Kernels

5. Dataset loading utilities

5.1. General dataset API

5.2. Toy datasets

5.3. Sample images

5.4. Sample generators

5.5. Datasets in svmlight / libsvm format

5.6. The Olivetti faces dataset

5.7. The 20 newsgroups text dataset

5.8. Downloading datasets from the mldata.org repository

5.9. The Labeled Faces in the Wild face recognition dataset

5.10. Forest covertypes

6. Scaling Strategies

6.1. Scaling with instances using out-of-core learning

7 . Computational Performance

7.1. Prediction Latency

7.2. Prediction Throughput

7.3. Tips and Tricks

最后編輯于：2017.11.27 03:24:47

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 227,428評論 6贊 531
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 98,024評論 3贊 413
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 175,285評論 0贊 373
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經(jīng)常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 62,548評論 1贊 307
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 71,328評論 6贊 404
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 54,878評論 1贊 321
城市分裂傳說
那天，我揣著相機(jī)與錄音，去河邊找鬼。笑死，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 42,971評論 3贊 439
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 42,098評論 0贊 286
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 48,616評論 1贊 331
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 40,554評論 3贊 354
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 42,725評論 1贊 369
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,243評論 5贊 355
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 43,971評論 3贊 345
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 34,361評論 0贊 25
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 35,613評論 1贊 280
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個(gè)月前我還...
沈念sama閱讀 51,339評論 3贊 390
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 47,695評論 2贊 370

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Machine Learning in Python (Scikit-learn)轉(zhuǎn)人人

Machine Learning in Python (Scikit-learn)轉(zhuǎn)人人

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Machine Learning in Python (Scikit-learn)轉(zhuǎn)人人

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频