如何按组分组variables?

假设我有两列数据。 第一个包含“First”,“Second”,“Third”等类别。第二个包含表示我看到“First”的次数的数字。

例如:

Category Frequency First 10 First 15 First 5 Second 2 Third 14 Third 20 Second 3 

我想按类别对数据进行sorting并对频率进行求和:

 Category Frequency First 30 Second 5 Third 34 

我如何在R中做到这一点?

使用aggregate

 aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum) Category x 1 First 30 2 Second 5 3 Third 34 

(embedded@thelatemail评论), aggregate也有一个公式接口

 aggregate(Frequency ~ Category, x, sum) 

或者如果你想聚合多个列,你可以使用. 符号(也适用于一列)

 aggregate(. ~ Category, x, sum) 

或者tapply

 tapply(x$Frequency, x$Category, FUN=sum) First Second Third 30 5 34 

使用这些数据:

 x <- data.frame(Category=factor(c("First", "First", "First", "Second", "Third", "Third", "Second")), Frequency=c(10,15,5,2,14,20,3)) 

最近,您也可以使用dplyr软件包来达到这个目的:

 library(dplyr) x %>% group_by(Category) %>% summarise(Frequency = sum(Frequency)) #Source: local data frame [3 x 2] # # Category Frequency #1 First 30 #2 Second 5 #3 Third 34 

或者,对于多个摘要列 (与一列一起工作):

 x %>% group_by(Category) %>% summarise_each(funs(sum)) 

dplyr> = 0.5的更新:summarise_each已被sumrise_all,summarise_at和summarise_ifreplace为dplyr中的函数族。

或者,如果您有多个要分组的列,则可以在group_by中用逗号分隔所有的

 mtcars %>% group_by(cyl, gear) %>% # multiple group columns summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns 

有关更多信息,包括%>%运算符,请参阅dplyr的介绍 。

rcs提供的答案很有用,而且很简单。 但是,如果您正在处理较大的数据集并需要提高性能,则有一个更快的select:

 library(data.table) data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3)) data[, sum(Frequency), by = Category] # Category V1 # 1: First 30 # 2: Second 5 # 3: Third 34 system.time(data[, sum(Frequency), by = Category] ) # user system elapsed # 0.008 0.001 0.009 

让我们比较一下,使用data.frame和上面的同样的东西:

 data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3)) system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum)) # user system elapsed # 0.008 0.000 0.015 

如果你想保留这个列,这是语法:

 data[,list(Frequency=sum(Frequency)),by=Category] # Category Frequency # 1: First 30 # 2: Second 5 # 3: Third 34 

大数据集的差异会变得更明显,如下面的代码所示:

 data = data.table(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000)) system.time( data[,sum(Frequency),by=Category] ) # user system elapsed # 0.055 0.004 0.059 data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000)) system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) ) # user system elapsed # 0.287 0.010 0.296 

对于多个聚合,您可以按如下方式组合lapply.SD

 data[, lapply(.SD, sum), by = Category] # Category Frequency # 1: First 30 # 2: Second 5 # 3: Third 34 

这与这个问题有些相关 。

你也可以使用by()函数:

 x2 <- by(x$Frequency, x$Category, sum) do.call(rbind,as.list(x2)) 

那些其他的包(plyr,reshape)具有返回data.frame的好处,但值得熟悉的是(),因为它是一个基本函数。

 library(plyr) ddply(tbl, .(Category), summarise, sum = sum(Frequency)) 

只是添加第三个选项:

 require(doBy) summaryBy(Frequency~Category, data=yourdataframe, FUN=sum) 

编辑:这是一个非常古老的答案。 现在我会推荐使用group_by并从dplyr中进行汇总,就像在@docendo中回答一样。

如果x是数据的数据框,那么下面的代码就可以做你想要的:

 require(reshape) recast(x, Category ~ ., fun.aggregate=sum) 

几年之后,仅仅为了添加另一个简单的基本R解决scheme,由于某些xtabs ,这里不存在

 xtabs(Frequency ~ Category, df) # Category # First Second Third # 30 5 34 

或者,如果想要一个data.frame回来

 as.data.frame(xtabs(Frequency ~ Category, df)) # Category Freq # 1 First 30 # 2 Second 5 # 3 Third 34 

虽然我最近成为了dplyr转换为大多数这些types的操作, sqldf包仍然是非常好的(和恕我直言更可读)的一些东西。

这是一个如何用sqldf来回答这个问题的例子

 x <- data.frame(Category=factor(c("First", "First", "First", "Second", "Third", "Third", "Second")), Frequency=c(10,15,5,2,14,20,3)) sqldf("select Category ,sum(Frequency) as Frequency from x group by Category") ## Category Frequency ## 1 First 30 ## 2 Second 5 ## 3 Third 34