将逗号分隔的列拆分成单独的行

我有一个数据框,像这样:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A')) 

正如您所看到的, director列中的一些条目是用逗号分隔的多个名称。 我想将这些条目分成不同的行,同时保持另一列的值。 例如,上面的数据框中的第一行应该被分成两行,在director列中有一个名字,在AB列中有'A'。

这个老问题经常被用作欺骗目标(用r-faq标记)。 截至今天为止,已经有三次提供了6种不同的方法,但缺乏一个基准作为指导,其中的方法是最快的1

基准解决scheme包括

  • Matthew Lundberg的基本R方法,但根据Rich Scriven的评论进行修改,
  • Jaap的两个data.table方法和两个dplyr / tidyr方法,
  • Ananda的splitstackshape解决scheme ,
  • 还有另外两个Jaap的data.table方法。

总的来说,使用microbenchmark软件包,在6种不同大小的dataframe上对8种不同的方法进行了基准testing(见下面的代码)。

由OP给出的样本数据仅包含20行。 为了创build更大的dataframe,这20行简单地重复1,10,100,1000,10000和100000次,这给出高达2百万行的问题大小。

基准testing结果

在这里输入图像描述

基准testing结果显示,对于足够大的dataframe,所有data.table方法都比其他任何方法都快。 对于超过5000行的dataframe,Jaap的data.table方法2和变体DT3是最快的,比最慢的方法更快。

值得注意的是,两种tidyverse方法和splistackshape解决scheme的时序非常相似,以至于很难对图表中的曲线进行splistackshape 。 它们是所有dataframe大小中基准testing方法中速度最慢的。

对于较小的dataframe,Matt的基本R解决scheme和data.table方法4的开销似乎比其他方法less。

 director <- c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", "Anne Fontaine", "Anthony Harvey") AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", "A", "B", "B", "B", "B", "B", "B", "A") library(data.table) library(magrittr) 

为问题规模n基准运行定义函数

 run_mb <- function(n) { # compute number of benchmark runs depending on problem size `n` mb_times <- scales::squish(10000L / n , c(3L, 100L)) cat(n, " ", mb_times, "\n") # create data DF <- data.frame(director = rep(director, n), AB = rep(AB, n)) DT <- as.data.table(DF) # start benchmarks microbenchmark::microbenchmark( matt_mod = { s <- strsplit(as.character(DF$director), ',') data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))}, jaap_DT1 = { DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB ][!is.na(director)]}, jaap_DT2 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]}, jaap_dplyr = { DF %>% dplyr::mutate(director = strsplit(as.character(director), ",")) %>% tidyr::unnest(director)}, jaap_tidyr = { tidyr::separate_rows(DF, director, sep = ",")}, cSplit = { splitstackshape::cSplit(DF, "director", ",", direction = "long")}, DT3 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][, director := NULL][ , setnames(.SD, "V1", "director")]}, DT4 = { DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), by = .(AB)]}, times = mb_times ) } 

针对不同的问题大小运行基准

 # define vector of problem sizes n_rep <- 10L^(0:5) # run benchmark for different problem sizes mb <- lapply(n_rep, run_mb) 

准备绘图数据

 mbl <- rbindlist(mb, idcol = "N") mbl[, n_row := NROW(director) * n_rep[N]] mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)] mba[, expr := forcats::fct_reorder(expr, -median_time)] 

创build图表

 library(ggplot2) ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + geom_point() + geom_smooth(se = FALSE) + scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + xlab("number of rows") + ylab("median of execution time [ms]") + ggtitle("microbenchmark results") + theme_bw() 

会话信息和包版本(摘录)

 devtools::session_info() #Session info # version R version 3.3.2 (2016-10-31) # system x86_64, mingw32 #Packages # data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2) # dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1) # forcats 0.2.0 2017-01-23 CRAN (R 3.3.2) # ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2) # magrittr * 1.5 2014-11-22 CRAN (R 3.3.0) # microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3) # scales 0.4.1 2016-11-09 CRAN (R 3.3.2) # splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3) # tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2) 

1 我的好奇心被这个充满活力的评论 激怒了! 数量级更快! 对这个问题的一个重复的问题作了一个简单的回答。

几个select:

1)用data.table两种方式:

 library(data.table) # method 1 (preferred) setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB ][!is.na(director)] # method 2 setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director) ][,.(director = V1, AB)] 

2) dplyr / tidyr组合:或者,您也可以使用dplyr / tidyr组合:

 library(dplyr) library(tidyr) v %>% mutate(director = strsplit(as.character(director), ",")) %>% unnest(director) 

3)只用tidyrtidyr 0.5.0 (及以后),你也可以使用separate_rows

 separate_rows(v, director, sep = ",") 

4)与基地R:

 # if 'director' is a character-column: stack(setNames(strsplit(df$director,','), df$AB)) # if 'director' is a factor-column: stack(setNames(strsplit(as.character(df$director),','), df$AB)) 

命名您的原始data.frame v ,我们有这样的:

 > s <- strsplit(as.character(v$director), ',') > data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length))) director AB 1 Aaron Blaise A 2 Bob Walker A 3 Akira Kurosawa B 4 Alan J. Pakula A 5 Alan Parker A 6 Alejandro Amenabar B 7 Alejandro Gonzalez Inarritu B 8 Alejandro Gonzalez Inarritu B 9 Benicio Del Toro B 10 Alejandro González Iñárritu A 11 Alex Proyas B 12 Alexander Hall A 13 Alfonso Cuaron B 14 Alfred Hitchcock A 15 Anatole Litvak A 16 Andrew Adamson B 17 Marilyn Fox B 18 Andrew Dominik B 19 Andrew Stanton B 20 Andrew Stanton B 21 Lee Unkrich B 22 Angelina Jolie B 23 John Stevenson B 24 Anne Fontaine B 25 Anthony Harvey A 

注意使用rep来构build新的AB列。 这里, sapply返回每个原始行中的名称数量。

在晚会之后,另一个广义的select是使用我的“splitstackshape”包中有一个direction参数的cSplit。 将其设置为"long"以获得您指定的结果:

 library(splitstackshape) head(cSplit(mydf, "director", ",", direction = "long")) # director AB # 1: Aaron Blaise A # 2: Bob Walker A # 3: Akira Kurosawa B # 4: Alan J. Pakula A # 5: Alan Parker A # 6: Alejandro Amenabar B