# 如何按列sorting数据框？

` `dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), levels = c("Low", "Med", "Hi"), ordered = TRUE), x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9), z = c(1, 1, 1, 2)) dd bxyz 1 Hi A 8 1 2 Med D 3 1 3 Hi A 9 1 4 Low C 9 2` `

` `R> dd[with(dd, order(-z, b)), ] bxyz 4 Low C 9 2 2 Med D 3 1 1 Hi A 8 1 3 Hi A 9 1` `

` `R> dd[ order(-dd[,4], dd[,1]), ] bxyz 4 Low C 9 2 2 Med D 3 1 1 Hi A 8 1 3 Hi A 9 1 R>` `

### 你的select

• `base` `order`
• `dplyr` `arrange`
• `setorder``data.table`
• `plyr` `arrange`
• `taRifx` `sort`
• `doBy`
• `Deducer`

` `dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), levels = c("Low", "Med", "Hi"), ordered = TRUE), x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9), z = c(1, 1, 1, 2)) library(taRifx) sort(dd, f= ~ -z + b )` `

` `library(plyr) arrange(dd,desc(z),b)` `

` `#Load each time dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), levels = c("Low", "Med", "Hi"), ordered = TRUE), x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9), z = c(1, 1, 1, 2)) library(microbenchmark) # Reload R between benchmarks microbenchmark(dd[with(dd, order(-z, b)), ] , dd[order(-dd\$z, dd\$b),], times=1000 )` `

`dd[with(dd, order(-z, b)), ]` 778

`dd[order(-dd\$z, dd\$b),]` 788

` `library(taRifx) microbenchmark(sort(dd, f= ~-z+b ),times=1000)` `

` `library(plyr) microbenchmark(arrange(dd,desc(z),b),times=1000)` `

` `library(doBy) microbenchmark(orderBy(~-z+b, data=dd),times=1000)` `

` `library(Deducer) microbenchmark(sortData(dd,c("z","b"),increasing= c(FALSE,TRUE)),times=1000)` `

` `esort <- function(x, sortvar, ...) { attach(x) x <- x[with(x,order(sortvar,...)),] return(x) detach(x) } microbenchmark(esort(dd, -z, b),times=1000)` `

` `m <- microbenchmark( arrange(dd,desc(z),b), sort(dd, f= ~-z+b ), dd[with(dd, order(-z, b)), ] , dd[order(-dd\$z, dd\$b),], times=1000 ) uq <- function(x) { fivenum(x)[4]} lq <- function(x) { fivenum(x)[2]} y_min <- 0 # min(by(m\$time,m\$expr,lq)) y_max <- max(by(m\$time,m\$expr,uq)) * 1.05 p <- ggplot(m,aes(x=expr,y=time)) + coord_cartesian(ylim = c( y_min , y_max )) p + stat_summary(fun.y=median,fun.ymin = lq, fun.ymax = uq, aes(fill=expr))` `

（线从下四分位延伸到上四分位，点是中位数）

` `## The data.frame way dd[with(dd, order(-z, b)), ] ## The data.table way: (7 fewer characters, but that's not the important bit) dd[order(-z, b)]` `

` `quarterlyreport[with(quarterlyreport,order(-z,b)),]` `

` `quarterlyreport[with(lastquarterlyreport,order(-z,b)),]` `

`data.table`我们关心这样的小细节。 所以我们做了一些简单的事情来避免input两次variables名。 东西很简单。 `i`已经在`dd`的框架内自动评估了。 `with()`完全不需要。

` `dd[with(dd, order(-z, b)), ]` `

` `dd[order(-z, b)]` `

` `quarterlyreport[with(lastquarterlyreport,order(-z,b)),]` `

` `quarterlyreport[order(-z,b)]` `

` `library(dplyr) # sort mtcars by mpg, ascending... use desc(mpg) for descending arrange(mtcars, mpg) # sort mtcars first by mpg, then by cyl, then by wt) arrange(mtcars , mpg, cyl, wt)` `

` `arrange(dd, desc(z), b) bxyz 1 Low C 9 2 2 Med D 3 1 3 Hi A 8 1 4 Hi A 9 1` `

R包`data.table`提供了一个简单的语法（Matt 在他的回答中很好地突出显示的一部分）对数据表的快速高效的sorting。 从那时起，已经有了很多的改进，并且还有一个新的函数`setorder()``setorder()`也可以在v1.9.5 `v1.9.5+`使用data.frames

### 数据：

` `require(plyr) require(doBy) require(data.table) require(dplyr) require(taRifx) set.seed(45L) dat = data.frame(b = as.factor(sample(c("Hi", "Med", "Low"), 1e8, TRUE)), x = sample(c("A", "D", "C"), 1e8, TRUE), y = sample(100, 1e8, TRUE), z = sample(5, 1e8, TRUE), stringsAsFactors = FALSE)` `

### 基准：

` `orderBy( ~ -z + b, data = dat) ## doBy plyr::arrange(dat, desc(z), b) ## plyr arrange(dat, desc(z), b) ## dplyr sort(dat, f = ~ -z + b) ## taRifx dat[with(dat, order(-z, b)), ] ## base R # convert to data.table, by reference setDT(dat) dat[order(-z, b)] ## data.table, base R like syntax setorder(dat, -z, b) ## data.table, using setorder() ## setorder() now also works with data.frames # R-session memory usage (BEFORE) = ~2GB (size of 'dat') # ------------------------------------------------------------ # Package function Time (s) Peak memory Memory used # ------------------------------------------------------------ # doBy orderBy 409.7 6.7 GB 4.7 GB # taRifx sort 400.8 6.7 GB 4.7 GB # plyr arrange 318.8 5.6 GB 3.6 GB # base R order 299.0 5.6 GB 3.6 GB # dplyr arrange 62.7 4.2 GB 2.2 GB # ------------------------------------------------------------ # data.table order 6.2 4.2 GB 2.2 GB # data.table setorder 4.5 2.4 GB 0.4 GB # ------------------------------------------------------------` `
• `data.table``DT[order(...)]`语法比其他方法（ `dplyr` ）的最快速度快10倍 ，同时消耗与`dplyr`相同数量的内存。

• `data.table``setorder()`比其他方法（ `dplyr` ）的最快速度快了14倍，而只有0.4GB的额外内存`dat`现在是我们要求的顺序（因为它是由参考更新）。

### data.tablefunction：

• data.table的sorting非常快，因为它实现了基数sorting 。

• 语法`DT[order(...)]`在内部进行了优化，以便使用data.table的快速sorting。 您可以继续使用熟悉的基本R语法，但可以加快进程（并使用更less的内存）。

• 大多数情况下，重新sorting后我们不需要原始的data.framedata.table 。 也就是说，我们通常把结果赋给同一个对象，例如：

` `DF <- DF[order(...)]` `

问题是这需要至less两次（2x）原始对象的内存。 为了提高内存效率data.table因此也提供了一个函数`setorder()`

`setorder()` `by reference`in-place `setorder()`重新sortingdata.tables ，而不做任何额外的副本。 它只使用等于一列大小的额外内存。

1. 它支持`integer``logical``numeric``character` ，甚至`bit64::integer64`types。

请注意， `factor``Date``POSIXct`等类是所有下面的`integer` / `numeric`types与附加属性，因此也支持。

2. 在基础R中，我们不能使用`-`在一个字符向量上按照列降序sorting。 相反，我们必须使用`-xtfrm(.)`

但是，在data.table中 ，我们可以做例如`dat[order(-x)]``setorder(dat, -x)`

` `sort(dd,by = ~ -z + b) # bxyz # 4 Low C 9 2 # 2 Med D 3 1 # 1 Hi A 8 1 # 3 Hi A 9 1` `

` `library(doBy) dd <- orderBy(~-z+b, data=dd)` `

` `newdata <- A[order(-A\$x),]` `

` `newdata <- A[order(-A\$x, A\$y, -A\$z),]` `

` `library(Deducer) dd<- sortData(dd,c("z","b"),increasing= c(FALSE,TRUE))` `

` `set.seed(1234) ID = 1:10 Age = round(rnorm(10, 50, 1)) diag = c("Depression", "Bipolar") Diagnosis = sample(diag, 10, replace=TRUE) data = data.frame(ID, Age, Diagnosis) databyAge = data[order(Age),] databyAge` `

` `my.data <- read.table(text = ' id age diagnosis 1 49 Depression 2 50 Depression 3 51 Depression 4 48 Depression 5 50 Depression 6 51 Bipolar 7 49 Bipolar 8 49 Bipolar 9 49 Bipolar 10 49 Depression ', header = TRUE)` `

` `databyage = my.data[order(age),]` `

` `databyage = my.data[order(my.data\$age),]` `

` `set.seed(1234) v1 <- c(0,0,0,0, 0,0,0,0, 1,1,1,1, 1,1,1,1) v2 <- c(0,0,0,0, 1,1,1,1, 0,0,0,0, 1,1,1,1) v3 <- c(0,0,1,1, 0,0,1,1, 0,0,1,1, 0,0,1,1) v4 <- c(0,1,0,1, 0,1,0,1, 0,1,0,1, 0,1,0,1) df.1 <- data.frame(v1, v2, v3, v4) df.1 rdf.1 <- df.1[sample(nrow(df.1), nrow(df.1), replace = FALSE),] rdf.1 order.rdf.1 <- rdf.1[do.call(order, as.list(rdf.1)),] order.rdf.1 order.rdf.2 <- rdf.1[do.call(order, rev(as.list(rdf.1))),] order.rdf.2 rdf.3 <- data.frame(rdf.1\$v2, rdf.1\$v4, rdf.1\$v3, rdf.1\$v1) rdf.3 order.rdf.3 <- rdf.1[do.call(order, as.list(rdf.3)),] order.rdf.3` `

` `dd <- dd[with(dd, order(-z, b)), ]` `

` `library(dplyr) library(data.table)` `

# dplyr

` `df1 <- tbl_df(iris) #using strings or formula arrange_(df1, c('Petal.Length', 'Petal.Width')) arrange_(df1, ~Petal.Length, ~Petal.Width) Source: local data frame [150 x 5] Sepal.Length Sepal.Width Petal.Length Petal.Width Species (dbl) (dbl) (dbl) (dbl) (fctr) 1 4.6 3.6 1.0 0.2 setosa 2 4.3 3.0 1.1 0.1 setosa 3 5.8 4.0 1.2 0.2 setosa 4 5.0 3.2 1.2 0.2 setosa 5 4.7 3.2 1.3 0.2 setosa 6 5.4 3.9 1.3 0.4 setosa 7 5.5 3.5 1.3 0.2 setosa 8 4.4 3.0 1.3 0.2 setosa 9 5.0 3.5 1.3 0.3 setosa 10 4.5 2.3 1.3 0.3 setosa .. ... ... ... ... ... #Or using a variable sortBy <- c('Petal.Length', 'Petal.Width') arrange_(df1, .dots = sortBy) Source: local data frame [150 x 5] Sepal.Length Sepal.Width Petal.Length Petal.Width Species (dbl) (dbl) (dbl) (dbl) (fctr) 1 4.6 3.6 1.0 0.2 setosa 2 4.3 3.0 1.1 0.1 setosa 3 5.8 4.0 1.2 0.2 setosa 4 5.0 3.2 1.2 0.2 setosa 5 4.7 3.2 1.3 0.2 setosa 6 5.5 3.5 1.3 0.2 setosa 7 4.4 3.0 1.3 0.2 setosa 8 4.4 3.2 1.3 0.2 setosa 9 5.0 3.5 1.3 0.3 setosa 10 4.5 2.3 1.3 0.3 setosa .. ... ... ... ... ... #Doing the same operation except sorting Petal.Length in descending order sortByDesc <- c('desc(Petal.Length)', 'Petal.Width') arrange_(df1, .dots = sortByDesc)` `

# data.table

` `dt1 <- data.table(iris) #not really required, as you can work directly on your data.frame sortBy <- c('Petal.Length', 'Petal.Width') sortType <- c(-1, 1) setorderv(dt1, sortBy, sortType) dt1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1: 7.7 2.6 6.9 2.3 virginica 2: 7.7 2.8 6.7 2.0 virginica 3: 7.7 3.8 6.7 2.2 virginica 4: 7.6 3.0 6.6 2.1 virginica 5: 7.9 3.8 6.4 2.0 virginica --- 146: 5.4 3.9 1.3 0.4 setosa 147: 5.8 4.0 1.2 0.2 setosa 148: 5.0 3.2 1.2 0.2 setosa 149: 4.3 3.0 1.1 0.1 setosa 150: 4.6 3.6 1.0 0.2 setosa` `

` `library(BBmisc) sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE)) bxyz 4 Low C 9 2 2 Med D 3 1 1 Hi A 8 1 3 Hi A 9 1` `

` `library(microbenchmark) microbenchmark(sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE)), times = 100000) median 202.878 library(plyr) microbenchmark(arrange(dd,desc(z),b),times=100000) median 148.758 microbenchmark(dd[with(dd, order(-z, b)), ], times = 100000) median 115.872` `

` ` dd <- dd[order(dd\$b, decreasing = FALSE),]` `

` `dd <- dd[order(dd\$z, decreasing = TRUE),]` `