加快write.table的性能

我有一个data.frame ,我想写出来。 我的data.frame的维度是256行65536列。 write.csv什么更快的select?

如果所有的列都是同一类,那么在写出之前转换为matrix,提供近6倍的加速。 另外,你可以从包MASS使用write.matrix() ,尽pipe这个例子没有certificate更快。 也许我没有正确设置一些东西:

 #Fake data m <- matrix(runif(256*65536), nrow = 256) #AS a data.frame system.time(write.csv(as.data.frame(m), "dataframe.csv")) #---------- # user system elapsed # 319.53 13.65 333.76 #As a matrix system.time(write.csv(m, "matrix.csv")) #---------- # user system elapsed # 52.43 0.88 53.59 #Using write.matrix() require(MASS) system.time(write.matrix(m, "writematrix.csv")) #---------- # user system elapsed # 113.58 59.12 172.75 

编辑

为了解决上面提到的问题,上面的结果对data.frame是不公平的,这里有更多的结果和时间表明整个消息仍然是“如果可能的话,把你的数据对象转换成一个matrix,如果不行的话,处理或者重新考虑为什么你需要写出一个200MB + CSV格式的文件,如果时间是最重要的“:

 #This is a data.frame m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256)) #This is still 6x slower system.time(write.csv(m2, "dataframe.csv")) # user system elapsed # 317.85 13.95 332.44 #This even includes the overhead in converting to as.matrix in the timing system.time(write.csv(as.matrix(m2), "asmatrix.csv")) # user system elapsed # 53.67 0.92 54.67 

所以,没有什么变化。 要确认这是合理的,请考虑as.data.frame()的相对时间成本:

 m3 <- as.matrix(m2) system.time(as.data.frame(m3)) # user system elapsed # 0.77 0.00 0.77 

所以,没有什么大不了的,也不像下面的评论那样歪曲信息。 如果您仍然不相信在大型data.frames上使用write.csv()在性能方面不是一个好主意,请参阅“ Note下的手册:

 write.table can be slow for data frames with large numbers (hundreds or more) of columns: this is inevitable as each column could be of a different class and so must be handled separately. If they are all of the same class, consider using a matrix instead. 

最后,如果你仍然失眠,而不是更快速地保存,那么考虑转移到本地的RData对象

 system.time(save(m2, file = "thisisfast.RData")) # user system elapsed # 21.67 0.12 21.81 

data.table::fwrite()由Otto Seiskari提供,可用于1.9.8以上版本。 Matt在顶部(包括并行)做了额外的增强,并写了一篇关于它的文章 。 请在跟踪器上报告任何问题。

首先,我们将上面的@chase(即大量的列: 65,000列(!) x 256行)和fwritewrite_feather ,以便在机器间保持一致性。 注意在base R中compress=FALSE的巨大差异

 # ----------------------------------------------------------------------------- # function | object type | output type | compress= | Runtime | File size | # ----------------------------------------------------------------------------- # save | matrix | binary | FALSE | 0.3s | 134MB | # save | data.frame | binary | FALSE | 0.4s | 135MB | # feather | data.frame | binary | FALSE | 0.4s | 139MB | # fwrite | data.table | csv | FALSE | 1.0s | 302MB | # save | matrix | binary | TRUE | 17.9s | 89MB | # save | data.frame | binary | TRUE | 18.1s | 89MB | # write.csv | matrix | csv | FALSE | 21.7s | 302MB | # write.csv | data.frame | csv | FALSE | 121.3s | 302MB | 

请注意, fwrite()并行运行。 这里显示的时间是在13芯Macbook Pro 2核心和1线程/核心(+2虚拟线程通过超线程),512GB固态硬盘,256KB /核心二级caching和4MB L4caching。 根据您的系统规格,YMMV。

我也在相对更可能(和更大)的数据上重新制定基准:

 library(data.table) NN <- 5e6 # at this number of rows, the .csv output is ~800Mb on my machine set.seed(51423) DT <- data.table( str1 = sample(sprintf("%010d",1:NN)), #ID field 1 str2 = sample(sprintf("%09d",1:NN)), #ID field 2 # varying length string field--think names/addresses, etc. str3 = replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")), # factor-like string field with 50 "levels" str4 = sprintf("%05d",sample(sample(1e5,50),NN,T)), # factor-like string field with 17 levels, varying length str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),NN,T), # lognormally distributed numeric num1 = round(exp(rnorm(NN,mean=6.5,sd=1.5)),2), # 3 binary strings str6 = sample(c("Y","N"),NN,T), str7 = sample(c("M","F"),NN,T), str8 = sample(c("B","W"),NN,T), # right-skewed (integer type) int1 = as.integer(ceiling(rexp(NN))), num2 = round(exp(rnorm(NN,mean=6,sd=1.5)),2), # lognormal numeric that can be positive or negative num3 = (-1)^sample(2,NN,T)*round(exp(rnorm(NN,mean=6,sd=1.5)),2)) # ------------------------------------------------------------------------------- # function | object | out | other args | Runtime | File size | # ------------------------------------------------------------------------------- # fwrite | data.table | csv | quote = FALSE | 1.7s | 523.2MB | # fwrite | data.frame | csv | quote = FALSE | 1.7s | 523.2MB | # feather | data.frame | bin | no compression | 3.3s | 635.3MB | # save | data.frame | bin | compress = FALSE | 12.0s | 795.3MB | # write.csv | data.frame | csv | row.names = FALSE | 28.7s | 493.7MB | # save | data.frame | bin | compress = TRUE | 48.1s | 190.3MB | # ------------------------------------------------------------------------------- 

所以在这个testing中, fwritefeather快两倍。 这是在上面提到的同一台机器上运行的, fwrite在两个内核上并行运行。

feather似乎也相当快的二进制格式,但没有压缩呢。


这里试图展示fwrite如何比较尺度:

注意:基准已经通过运行基本R的save()进行更新, compress = FALSE (因为羽还没有被压缩)。

MB

所以fwrite是这个数据上运行速度最快的(在2个内核上运行)加上它创build一个.csv ,可以很容易地查看,检查并传递给grepsed等。

复制代码:

 require(data.table) require(microbenchmark) require(feather) ns <- as.integer(10^seq(2, 6, length.out = 25)) DTn <- function(nn) data.table( str1 = sample(sprintf("%010d",1:nn)), str2 = sample(sprintf("%09d",1:nn)), str3 = replicate(nn,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")), str4 = sprintf("%05d",sample(sample(1e5,50),nn,T)), str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),nn,T), num1 = round(exp(rnorm(nn,mean=6.5,sd=1.5)),2), str6 = sample(c("Y","N"),nn,T), str7 = sample(c("M","F"),nn,T), str8 = sample(c("B","W"),nn,T), int1 = as.integer(ceiling(rexp(nn))), num2 = round(exp(rnorm(nn,mean=6,sd=1.5)),2), num3 = (-1)^sample(2,nn,T)*round(exp(rnorm(nn,mean=6,sd=1.5)),2)) count <- data.table(n = ns, c = c(rep(1000, 12), rep(100, 6), rep(10, 7))) mbs <- lapply(ns, function(nn){ print(nn) set.seed(51423) DT <- DTn(nn) microbenchmark(times = count[n==nn,c], write.csv=write.csv(DT, "writecsv.csv", quote=FALSE, row.names=FALSE), save=save(DT, file = "save.RData", compress=FALSE), fwrite=fwrite(DT, "fwrite_turbo.csv", quote=FALSE, sep=","), feather=write_feather(DT, "feather.feather"))}) png("microbenchmark.png", height=600, width=600) par(las=2, oma = c(1, 0, 0, 0)) matplot(ns, t(sapply(mbs, function(x) { y <- summary(x)[,"median"] y/y[3]})), main = "Relative Speed of fwrite (turbo) vs. rest", xlab = "", ylab = "Time Relative to fwrite (turbo)", type = "l", lty = 1, lwd = 2, col = c("red", "blue", "black", "magenta"), xaxt = "n", ylim=c(0,25), xlim=c(0, max(ns))) axis(1, at = ns, labels = prettyNum(ns, ",")) mtext("# Rows", side = 1, las = 1, line = 5) legend("right", lty = 1, lwd = 3, legend = c("write.csv", "save", "feather"), col = c("red", "blue", "magenta")) dev.off() 

另一种select是使用羽化文件格式。

 df <- as.data.frame(matrix(runif(256*65536), nrow = 256)) system.time(feather::write_feather(df, "df.feather")) #> user system elapsed #> 0.237 0.355 0.617 

Feather是一种二进制文件格式,它的devise非常高效,可以读写。 它的devise目的是使用多种语言:目前有R和Python客户端,并且正在开发一个Julia客户端。

为了比较,这是saveRDS需要多长时间:

 system.time(saveRDS(df, "df.rds")) #> user system elapsed #> 17.363 0.307 17.856 

现在,这是一个有点不公平的比较,因为saveRDS的默认值是压缩数据,这里的数据是不可压缩的,因为它是完全随机的。 closures压缩使saveRDS显着加快:

 system.time(saveRDS(df, "df.rds", compress = FALSE)) #> user system elapsed #> 0.181 0.247 0.473 

事实上,它现在比羽毛稍快。 那为什么要用羽毛呢? 那么,它通常比readRDS()更快,而且通常相对于读取次数来说,写入数据的次数相对较less。

 system.time(readRDS("df.rds")) #> user system elapsed #> 0.198 0.090 0.287 system.time(feather::read_feather("df.feather")) #> user system elapsed #> 0.125 0.060 0.185 

你也可以尝试'readr'包的read_rds(比较data.table :: fread)和write_rds(比较data.table :: fwrite)。

以下是我的数据集(1133行和429499列)中的一个简单示例:

写入数据集

fwrite(rankp2,file="rankp2_429499.txt",col.names=T,row.names=F,quote = F,sep="\t") write_rds(rankp2,"rankp2_429499.rds")

读取数据集(1133行和429499列)

system.time(fread("rankp2_429499.txt",sep="\t",header=T,fill = TRUE)) user system elapsed 42.391 0.526 42.949

system.time(read_rds("rankp2_429499.rds")) user system elapsed 2.157 0.388 2.547

希望它有帮助。