今天看到有人提問用readr::read_csv()
讀csv文件時把所有character型的變量讀成factor型,HY大牛提供了一個方法用dplyr
包的mutate_if()
,做變量類型轉換速度很快。我后來搜索了一下data.table
包里fread()
讀csv時可以直接設置stringsAsFactors = T
。所以就對比了一下readr::read_csv() + dplyr::mutate_if()
和data.table::fread()
的速度,同時用base
自帶的read.csv()
做benchmark。
數據1: 10列,每列10個level,100,000行數據
library(dplyr)
library(data.table)
library(readr)
# test1: 10 columns with 10 levels for each column, 100,000 rows
v1<-as.factor(paste('A',c(1:10), sep=''))
df<-data.frame(matrix(nrow=100000))
for(i in 1:10){
df[,i]<-sample(v1, 100000, replace = T)
names(df)[i]<-paste('v', i, sep='')
}
write.csv(df, '/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', row.names = F)
system.time(x1<-read.csv('/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', header=T, stringsAsFactors = T))
# user system elapsed
# 1.080 0.054 1.326
system.time(x2<-read_csv('/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', col_names = T))
# user system elapsed
# 0.153 0.021 0.261
system.time(x2<-x2 %>% mutate_if(is.character, factor))
# user system elapsed
# 0.089 0.016 0.157
system.time(x3<-fread(input='/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', stringsAsFactors = T))
# user system elapsed
# 0.111 0.012 0.255
system.time(x3<-as.data.frame(x3))
# user system elapsed
# 0.001 0.000 0.002
因為fread
產生的是data.table
對象,所以還要多一步把它轉換成data.frame
類型。
僅看elapsed time:fread+as.data.frame略快
方法 | 第一步 | 第二步 | 總計 |
---|---|---|---|
read.csv | 1.326 | 1.326 | |
read_csv+mutate_if() | 0.261 | 0.157 | 0.418 |
fread+as.data.frame | 0.255 | 0.002 | 0.257 |
數據2: 100列,每列10個level,100,000行數據
v1<-as.factor(paste('A',c(1:10), sep=''))
df<-data.frame(matrix(nrow=100000))
for(i in 1:100){
df[,i]<-sample(v1, 100000, replace = T)
names(df)[i]<-paste('v', i, sep='')
}
write.csv(df, '/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', row.names = F)
system.time(x1<-read.csv('/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', header=T, stringsAsFactors = T))
# user system elapsed
# 12.406 1.200 19.187
system.time(x2<-read_csv('/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', col_names = T))
# user system elapsed
# 1.816 0.309 2.909
system.time(x2<-x2 %>% mutate_if(is.character, factor))
# user system elapsed
# 0.833 0.222 1.163
system.time(x3<-fread(input='/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', stringsAsFactors = T))
# user system elapsed
# 1.117 0.275 2.277
system.time(x3<-as.data.frame(x3))
# user system elapsed
# 0.025 0.088 0.115
僅看elapsed time:fread()
拉開差距了
方法 | 第一步 | 第二步 | 總計 |
---|---|---|---|
read.csv | 19.187 | 19.187 | |
read_csv+mutate_if() | 2.909 | 1.163 | 4.072 |
fread+as.data.frame | 2.277 | 0.115 | 2.392 |
數據3: 100列,每列100個level,1,000,000行數據
這里就不看read.csv()
了哈,電腦會燙死的
v1<-as.factor(paste('A',c(1:100), sep=''))
df<-data.frame(matrix(nrow=1000000))
for(i in 1:100){
df[,i]<-sample(v1, 1000000, replace = T)
names(df)[i]<-paste('v', i, sep='')
}
write.csv(df, '/Users/xiatt/Desktop/compare_read_csv_with_factors3.csv', row.names = F)
system.time(x2<-read_csv('/Users/xiatt/Desktop/compare_read_csv_with_factors3.csv', col_names = T))
# user system elapsed
# 22.708 13.303 55.010
system.time(x2<-x2 %>% mutate_if(is.character, factor))
# user system elapsed
# 6.074 2.329 9.411
system.time(x3<-fread(input='/Users/xiatt/Desktop/compare_read_csv_with_factors3.csv', stringsAsFactors = T))
# user system elapsed
# 15.236 6.787 38.246
system.time(x3<-as.data.frame(x3))
# user system elapsed
# 0.238 0.809 1.072
僅看elapsed time:這里差距就比較明顯了,fread()
更快一些。
方法 | 第一步 | 第二步 | 總計 |
---|---|---|---|
read_csv+mutate_if() | 55.010 | 9.411 | 64.421 |
fread+as.data.frame | 38.246 | 1.072 | 39.318 |
其他對比
- 在行列數相同的情況下,每列的level數增加到100并不會影響讀取時間。
-
fread()
有無stringsAsFactors = T
也并不會影響讀取時間。 - 在
data.table
中轉換每列的類型并不比mutate_if()
快多少。
結論
所以結論就是data.table
中的fread
包更快一些些啦。
一點衍生閱讀
-
readr
包的作者關于readr
和data.table::fread()
的對比,很實誠:
Compared to fread, readr functions:
Are slower (currently ~1.2-2x slower. If you want absolutely the best performance, use data.table::fread().
data.table
和pandas
的處理速度對比:grouping
結論是data.table
稍稍快一些。HY推薦的python的
ParaText
挑戰群雄,感覺很厲害呀,鏈接