我怎样才能更快速地对观察组进行排名呢？

我有一个非常简单的问题，但是我可能没有想到vector-y足以有效地解决这个问题。我尝试了两种不同的方法，现在已经在两台不同的计算机上循环了很长一段时间。我希望我能说比赛让比赛变得更加激动人心，但是……呃。

在组中排名观察

我有很长的数据（每人多行，每人观察一行），而且我基本上想要一个variables，告诉我这个人已经被观察到的频率。

我有前两列，想要第三个：

person wave obs pers1 1999 1 pers1 2000 2 pers1 2003 3 pers2 1998 1 pers2 2001 2

现在我正在使用两个循环方法。两个都极其缓慢（150k行）。我确定我错过了一些东西，但我的search查询并没有真正帮助我（很难说出问题）。

感谢任何指针！

 # ordered dataset by persnr and year of observation person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ] person.obs$n.obs = 0 # first approach: loop through people and assign range unp = unique(person.obs$PERSNR) unplength = length(unp) for(i in 1:unplength) { print(unp[i]) person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs = 1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs) i=i+1 gc() } # second approach: loop through rows and reset counter at new person pnr = 0 for(i in 1:length(person.obs[,2])) { if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR e = 0 } e=e+1 person.obs[i,]$n.obs = e i=i+1 gc() }

data.table和dplyr包的几个select。

data.table：

 library(data.table) setDT(foo)[, rn := 1:.N, by = person] # setDT(foo) is needed to convert to a data.table

或者使用新的rowid函数（ v1.9.7 + ，目前仅在开发版本中可用）

 setDT(foo)[, rn := rowid(person)]

都给：

 > foo person year rn 1: pers1 1999 1 2: pers1 2000 2 3: pers1 2003 3 4: pers2 1998 1 5: pers2 2011 2

如果你想要一个真正的排名，你应该使用frankfunction：

 setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]

dplyr：

 library(dplyr) # method 1 foo <- foo %>% group_by(person) %>% mutate(rn = row_number()) # method 2 foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())

两者都给出了类似的结果：

 > foo Source: local data frame [5 x 3] Groups: person [2] person year rn (fctr) (dbl) (int) 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2

马立克在这个问题上的回答在过去被certificate是非常有用的。我写下来，几乎每天都使用它，因为它速度快，效率高。我们将使用ave()和seq_along() 。

 foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011)) foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along)) foo person year obs 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2

另一个select使用plyr

 library(plyr) ddply(foo, "person", transform, obs2 = seq_along(person)) person year obs obs2 1 pers1 1999 1 1 2 pers1 2000 2 2 3 pers1 2003 3 3 4 pers2 1998 1 1 5 pers2 2011 2 2

会做伎俩？

 > foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011),obs=c(1,2,3,1,2)) > foo person year obs 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2 > by(foo, foo$person, nrow) foo$person: pers1 [1] 3 ------------------------------------------------------------ foo$person: pers2 [1] 2

另一个在基础R中使用aggregate和rank选项：

 foo$obs <- unlist(aggregate(.~person, foo, rank)[,2]) # person year obs # 1 pers1 1999 1 # 2 pers1 2000 2 # 3 pers1 2003 3 # 4 pers2 1998 1 # 5 pers2 2011 2

我怎样才能更快速地对观察组进行排名呢？

在组中排名观察

返回64位整数中所有设置位的位置的最快方法是什么？

如何使用prefetchPlugin＆分析工具来优化webpack的构build时间？

什么是find重叠矩形区域的高效algorithm

如何在CSS中使用3位数的颜色代码而不是6位数的颜色代码？

为什么C ++优化器对这些临时variables有问题，或者为什么在紧密循环中应该避免使用v `？

iPhone – dequeueReusableCellWithIdentifier用法

加快茱莉亚写得不好的R例子

获取实现接口的所有types

如何有效地使用MySQLDB的SScursor？

logging器slf4j格式化的优点是{}而不是string连接