本篇主要介紹xgboost在R中的使用,主要參考了here出的文章。
data loading
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
其中train包含有data和label,同理test。
basic training
使用xgboost時必須設定的幾個參數:
objective: 目標函數,如binary:logistic,表示二分類
max_depth: 樹的深度
nthread: 調用線程數
nrounds: 樹的棵樹
eta: 學習率
xgb = xgboost(data=train$data, label=train$label, max_depth=2, eta=1, objective='binary:logistic', nrounds=2)
## [1] train-error:0.046522
## [2] train-error:0.022263
xgb.DMatrix
用于組合train$data和train$label:
dtrain = xgb.DMatrix(data=train$data, label=train$label)
bst = xgboost(data=dtrain, max_depth=2, eta=1, objective='binary:logistic', nrounds=2)
verbose option
用于設置訓練過程中顯示的信息,其中:
verbose=0,no message
verbose=1,evaluation metric
verbose=2,evaluation metric + tree information
bst = xgboost(data=dtrain, max_depth=2, eta=1, objective='binary:logistic', nrounds=2, verbose=2)
## [22:38:46] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
## [1] train-error:0.046522
## [22:38:46] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
## [2] train-error:0.022263
predict
在使用xgboost訓練得到模型bst后,可以用于預測test$data的label.
pred = predict(bst, test$data)
prediction = as.numeric(pred > 0.5)
print(head(prediction, 5))
err = mean(as.numeric(prediction != test$label))
xgb.train
使用該方法,可以在每一個round結束后,計算測試集的準確率,從而選擇不overfitting的模型.
dtrain = xgb.DMatrix(data=train$data, label=train$label)
dtest = xgb.DMatrix(data=test$data, label=test$label)
watchlist = list(train=dtrain, test=dtest)
bst = xgb.train(data=dtrain, max_depth=2, eta=1, nthread=2, nrounds=2, watchlist=watchlist, objective='binary:logistic')
linear boosting
前面所用的模型均基于boosting tree, 通過設置booster參數booster='gblinear',同時remove eta,我們可以使用linear boosting.
bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
# 設定兩個eval_metric,查看模型預測結果
保存、加載
# DMatrix save & load
xgb.DMatrix.save(dtrain, 'dtrain.buffer')
dtrain = xgb.DMatrix('dtrain.buffer')
# model save & load
xgb.save(bst, 'xgboost.model')
bst = xgb.load('xgboost.model')
查看模型中變量的重要性
import_mat = xgb.importance(names(train$data), model=bst)
print(import_mat)
xgb.plot.importance(importance_matrix=import_mat)
查看樹
使用xgb.dump(model, with_stats=T),使用xgb.plot.tree(model)
則可以畫出模型中的樹。