有没有比rbind.fill（list）有效的方法？

我有一个不同的列的数据框列表，我想把它们合并成一个数据框。我使用rbind.fill来做到这一点。我正在寻找更有效的方法。类似于这里给出的答案

require(plyr) set.seed(45) sample.fun <- function() { nam <- sample(LETTERS, sample(5:15)) val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10)) setNames(val, nam) } ll <- replicate(1e4, sample.fun()) rbind.fill(ll)

更新：请参阅此更新的答案。

UPDATE（eddi）：现在已经在版本1.8.11中作为rbind的fill参数实现了。例如：

 DT1 = data.table(a = 1:2, b = 1:2) DT2 = data.table(a = 3:4, c = 1:2) rbind(DT1, DT2, fill = TRUE) # abc #1: 1 1 NA #2: 2 2 NA #3: 3 NA 1 #4: 4 NA 2

FR＃4790现在join – rbind.fill（来自plyr）likefunction合并data.frames / data.tables列表

注1：

这个解决scheme使用data.table的rbindlist函数来“ data.table ” rbindlist列表，为此，请确保使用版本1.8.9，因为这个bug在版本<1.8.9 。

笔记2：

在绑定data.frames / data.tables列表时，rbindlist将保留第一列的数据types。也就是说，如果第一个data.frame中的列是字符，并且第二个data.frame中的同一列是“factor”，那么rbindlist将导致这个列是一个字符。所以，如果你的data.frame由所有字符列组成，那么你用这个方法的解决scheme将和plyr方法相同。如果不是这样，那么这些值仍然是相同的，但是一些列将是字符而不是因子。之后你必须自己转换成“因素”。希望这种行为将来会改变。

现在这里使用data.table （和来自plyr基准比较）：

 require(data.table) rbind.fill.DT <- function(ll) { # changed sapply to lapply to return a list always all.names <- lapply(ll, names) unq.names <- unique(unlist(all.names)) ll.m <- rbindlist(lapply(seq_along(ll), function(x) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_] setcolorder(tt, unq.names) })) } rbind.fill.PLYR <- function(ll) { rbind.fill(ll) } require(microbenchmark) microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10) # Unit: seconds # expr min lq median uq max neval # t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10 # t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10 # for comparison change t2 to data.table setattr(t2, 'class', c('data.table', 'data.frame')) data.table:::settruelength(t2, 0L) invisible(alloc.col(t2)) setcolorder(t2, unique(unlist(sapply(ll, names)))) identical(t1, t2) # [1] TRUE

应该注意的是， plyr的rbind.fill超越这个特定的data.table解决scheme，直到列表大小约为500。

标杆图：

下面是使用seq(1000, 10000, by=1000) 1000，10000 seq(1000, 10000, by=1000)列表长度data.frames的运行情节。我已经在这些不同列表长度的每一个上使用了10次微基准。

在这里输入图像描述

基准要点：

这是基准testing的要点 ，以防有人想要复制结果。

现在rbindlist （和rbind ）已经在data.table （开发版本）中最近的改变/提交下改进了function和速度，而且dplyr有一个更快版本的plyr的rbind.fill ，叫做rbind_all ，这个答案我的似乎有点过时了。

这里是rbindlist的相关新闻条目：

 o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249 -> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default) -> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default, for compatibility with base (and backwards compatibility). -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE. -> At least one item of the input list has to have non-null column names. -> Duplicate columns are bound in the order of occurrence, like base. -> Attributes that might exist in individual items would be lost in the bound result. -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible. -> And incredibly fast ;). -> Documentation updated in much detail. Closes DR #5158.

所以，我已经在下面比较大的数据上对新的（和更快的版本）进行了基准testing。

新基准：

我们将创build总计10000个data.tables，列的范围从200到300，绑定后的列总数为500。

函数来创build数据：

 require(data.table) ## 1.9.3 commit 1267 require(dplyr) ## commit 1504 devel set.seed(1L) names = paste0("V", 1:500) foo <- function() { cols = sample(200:300, 1) data = setDT(lapply(1:cols, function(x) sample(10))) setnames(data, sample(names)[1:cols]) } n = 10e3L ll = vector("list", n) for (i in 1:n) { .Call("Csetlistelt", ll, i, foo()) }

这是时间：

 ## Updated timings on data.table v1.9.5 - three consecutive runs: system.time(ans1 <- rbindlist(ll, fill=TRUE)) # user system elapsed # 1.993 0.106 2.107 system.time(ans1 <- rbindlist(ll, fill=TRUE)) # user system elapsed # 1.644 0.092 1.744 system.time(ans1 <- rbindlist(ll, fill=TRUE)) # user system elapsed # 1.297 0.088 1.389 ## dplyr's rbind_all - Timings for three consecutive runs system.time(ans2 <- rbind_all(ll)) # user system elapsed # 9.525 0.121 9.761 # user system elapsed # 9.194 0.112 9.370 # user system elapsed # 8.665 0.081 8.780 identical(ans1, setDT(ans2)) # [1] TRUE

如果你同时使用rbind.fill和rbindlist还有什么rbindlist 。结果是通过data.table版本1.8.8完成的，因为1.8.9版本在使用并行函数的时候已经变成了砖块。所以data.table和plyr之间的结果是不一样的，但是它们在data.table或者plyr解决scheme中是一样的。意义平行plyr匹配不平行plyr ，反之亦然。

这是基准/脚本。 parallel.rbind.fill.DT看起来很可怕，但是这是我能拉的最快的一个。

 require(plyr) require(data.table) require(ggplot2) require(rbenchmark) require(parallel) # data.table::rbindlist solutions rbind.fill.DT <- function(ll) { all.names <- lapply(ll, names) unq.names <- unique(unlist(all.names)) rbindlist(lapply(seq_along(ll), function(x) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_] setcolorder(tt, unq.names) })) } parallel.rbind.fill.DT <- function(ll, cluster=NULL){ all.names <- lapply(ll, names) unq.names <- unique(unlist(all.names)) if(is.null(cluster)){ ll.m <- rbindlist(lapply(seq_along(ll), function(x) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_] setcolorder(tt, unq.names) })) }else{ cores <- length(cluster) sequ <- as.integer(seq(1, length(ll), length.out = cores+1)) Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ") ll <- eval(parse(text=paste("list(", Call, ")"))) rbindlist(clusterApply(cluster, ll, function(ll, unq.names){ rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_] setcolorder(tt, unq.names) }, ll=ll, unq.names=unq.names)) }, unq.names=unq.names)) } } # plyr::rbind.fill solutions rbind.fill.PLYR <- function(ll) { rbind.fill(ll) } parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){ if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){ rbind.fill(ll) }else{ cores <- length(cluster) sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst))) Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ") ll <- eval(parse(text=paste("list(", Call, ")"))) rbind.fill(parLapply(cluster, ll, rbind.fill)) } } # Function to generate sample data of varying list length set.seed(45) sample.fun <- function() { nam <- sample(LETTERS, sample(5:15)) val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10)) setNames(val, nam) } ll <- replicate(10000, sample.fun()) cl <- makeCluster(4, type="SOCK") clusterEvalQ(cl, library(data.table)) clusterEvalQ(cl, library(plyr)) benchmark(t1 <- rbind.fill.PLYR(ll), t2 <- rbind.fill.DT(ll), t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400), t4 <- parallel.rbind.fill.DT(ll, cluster=cl), replications=5) stopCluster(cl) # Results for rbinding 10000 dataframes # done with 4 cores, i5 3570k and 16gb memory # test reps elapsed relative # rbind.fill.PLYR 5 321.80 16.682 # rbind.fill.DT 5 26.10 1.353 # parallel.rbind.fill.PLYR 5 28.00 1.452 # parallel.rbind.fill.DT 5 19.29 1.000 # checking are results equal t1 <- as.matrix(t1) t2 <- as.matrix(t2) t3 <- as.matrix(t3) t4 <- as.matrix(t4) t1 <- t1[order(t1[, 1], t1[, 2]), ] t2 <- t2[order(t2[, 1], t2[, 2]), ] t3 <- t3[order(t3[, 1], t3[, 2]), ] t4 <- t4[order(t4[, 1], t4[, 2]), ] identical(t2, t4) # TRUE identical(t1, t3) # TRUE identical(t1, t2) # FALSE, mismatch between plyr and data.table

正如你可以看到， rbind.fill使它与data.table相媲美，并且即使在数据data.table低的情况下，也可以通过对data.table进行data.table ，从而使速度边际增加。

有没有比rbind.fill（list）有效的方法？

FR＃4790现在join – rbind.fill（来自plyr）likefunction合并data.frames / data.tables列表

注1：

笔记2：

标杆图：

基准要点：

新基准：

函数来创build数据：

按升序/降序快速sortingdata.table

在`data.table`中使用dynamic列名

R用data.table总结多个列

在data.table中设置密钥的目的是什么？

.EACHI在data.table中

为什么pandas在python合并比data.table合并R？

在一个函数调用中将多个列添加到R data.table？

如何重新sortingdata.table列（不复制）

如何在每个组中创建一个滞后变量？

在R中滚动连接data.table