比较两个data.frames来查找data.frame 1中不存在于data.frame 2中的行

我有以下2个data.frames：

a1 <- data.frame(a = 1:5, b=letters[1:5]) a2 <- data.frame(a = 1:3, b=letters[1:3])

我想find行a1有a2没有。

有这种types的操作内置函数？

（ps：我为它写了一个解决scheme，我只是好奇，如果有人已经做了一个更精心的代码）

这是我的解决scheme：

 a1 <- data.frame(a = 1:5, b=letters[1:5]) a2 <- data.frame(a = 1:3, b=letters[1:3]) rows.in.a1.that.are.not.in.a2 <- function(a1,a2) { a1.vec <- apply(a1, 1, paste, collapse = "") a2.vec <- apply(a2, 1, paste, collapse = "") a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,] return(a1.without.a2.rows) } rows.in.a1.that.are.not.in.a2(a1,a2)

这不直接回答你的问题，但它会给你的共同的元素。这可以通过Paul Murrell的套餐来compare ：

 library(compare) a1 <- data.frame(a = 1:5, b = letters[1:5]) a2 <- data.frame(a = 1:3, b = letters[1:3]) comparison <- compare(a1,a2,allowAll=TRUE) comparison$tM # ab #1 1 a #2 2 b #3 3 c

函数compare为您提供了很大的灵活性，比如允许哪种比较（例如，改变每个向量的元素的顺序，改变variables的顺序和名称，缩短variables，改变string的大小写）。从这个angular度来看，你应该能够弄清楚其中的一个是什么。例如（这不是很优雅）：

 difference <- data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i]))) colnames(difference) <- colnames(a1) difference # ab #1 4 d #2 5 e

SQLDF提供了一个很好的解决scheme

 a1 <- data.frame(a = 1:5, b=letters[1:5]) a2 <- data.frame(a = 1:3, b=letters[1:3]) require(sqldf) a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

和两个dataframe中的行：

 a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

新版本的dplyr有一个函数anti_join ，用于这种比较

 require(dplyr) anti_join(a1,a2)

和semi_join过滤a1中也在a2

 semi_join(a1,a2)

对于这个特定的目的肯定是没有效率的，但是我经常在这些情况下做的是在每个data.frame中插入指示符variables，然后合并：

 a1$included_a1 <- TRUE a2$included_a2 <- TRUE res <- merge(a1, a2, all=TRUE)

included_a1中的缺失值将会logginga1中缺less哪些行。类似于a2。

您的解决scheme的一个问题是列顺序必须匹配。另一个问题是，很容易想象在实际上不同的情况下行被编码为相同的情况。使用合并的优点是，您可以免费获得所有错误检查，这对于一个好的解决scheme来说是必需的。

在dplyr ：

 setdiff(a1,a2)

基本上， setdiff(bigFrame, smallFrame)可以在第一个表中获得额外的logging。

在SQLverse中，这被称为a

不包括加入维恩图

对于所有join选项和设置主题的好的描述，这是我迄今见过的最好的总结之一： http : //www.vertabelo.com/blog/technical-articles/sql-joins

但是回到这个问题 – 下面是使用OP的数据时的setdiff()代码的结果：

 > a1 ab 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e > a2 ab 1 1 a 2 2 b 3 3 c > setdiff(a1,a2) ab 1 4 d 2 5 e

甚至anti_join(a1,a2)会得到相同的结果。
欲了解更多信息： https ： //www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

我写了一个包（ https://github.com/alexsanjoseph/compareDF ），因为我有同样的问题。

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5) > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3) > df_compare = compare_df(df1, df2, "row") > df_compare$comparison_df row chng_type ab 1 4 + 4 d 2 5 + 5 e

一个更复杂的例子：

 library(compareDF) df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Duster 360", "Merc 240D"), id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"), hp = c(110, 110, 181, 110, 245, 62), cyl = c(6, 6, 4, 6, 8, 4), qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00)) df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", " Hornet Sportabout", "Valiant"), id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"), hp = c(110, 110, 93, 110, 175, 105), cyl = c(6, 6, 4, 6, 8, 6), qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22)) > df_compare$comparison_df grp chng_type id1 id2 hp cyl qsec 1 1 - Hornet Sportabout Dus 175 8 17.02 2 2 + Datsun 710 Dat 181 4 33.00 3 2 - Datsun 710 Dat 93 4 18.61 4 3 + Duster 360 Dus 245 8 15.84 5 7 + Merc 240D Mer 62 4 20.00 6 8 - Valiant Val 105 6 20.22

该软件包还有一个用于快速检查的html_output命令

df_compare $ html_output

我调整了合并function来获得这个function。在较大的数据框上，它使用的内存less于完整的合并解决scheme。我可以玩键列的名字。

另一个解决scheme是使用库问题。

 # Derived from src/library/base/R/merge.R # Part of the R package, http://www.R-project.org # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # A copy of the GNU General Public License is available at # http://www.r-project.org/Licenses/ XinY <- function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, notin = FALSE, incomparables = NULL, ...) { fix.by <- function(by, df) { ## fix up 'by' to be a valid set of cols by number: 0 is row.names if(is.null(by)) by <- numeric(0L) by <- as.vector(by) nc <- ncol(df) if(is.character(by)) by <- match(by, c("row.names", names(df))) - 1L else if(is.numeric(by)) { if(any(by < 0L) || any(by > nc)) stop("'by' must match numbers of columns") } else if(is.logical(by)) { if(length(by) != nc) stop("'by' must match number of columns") by <- seq_along(by)[by] } else stop("'by' must specify column(s) as numbers, names or logical") if(any(is.na(by))) stop("'by' must specify valid column(s)") unique(by) } nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y)) by.x <- fix.by(by.x, x) by.y <- fix.by(by.y, y) if((lb <- length(by.x)) != length(by.y)) stop("'by.x' and 'by.y' specify different numbers of columns") if(lb == 0L) { ## was: stop("no columns to match on") ## returns x x } else { if(any(by.x == 0L)) { x <- cbind(Row.names = I(row.names(x)), x) by.x <- by.x + 1L } if(any(by.y == 0L)) { y <- cbind(Row.names = I(row.names(y)), y) by.y <- by.y + 1L } ## create keys from 'by' columns: if(lb == 1L) { # (be faster) bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx) by <- y[, by.y]; if(is.factor(by)) by <- as.character(by) } else { ## Do these together for consistency in as.character. ## Use same set of names. bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE] names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="") bz <- do.call("paste", c(rbind(bx, by), sep = "\r")) bx <- bz[seq_len(nx)] by <- bz[nx + seq_len(ny)] } comm <- match(bx, by, 0L) if (notin) { res <- x[comm == 0,] } else { res <- x[comm > 0,] } } ## avoid a copy ## row.names(res) <- NULL attr(res, "row.names") <- .set_row_names(nrow(res)) res } XnotinY <- function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, notin = TRUE, incomparables = NULL, ...) { XinY(x,y,by,by.x,by.y,notin,incomparables) }

使用diffobj包：

 library(diffobj) diffPrint(a1, a2) diffObj(a1, a2)

在这里输入图像说明

您的示例数据没有任何重复，但您的解决scheme自动处理它们。这意味着在重复的情况下，可能有些答案与您的函数的结果不匹配。
这是我的解决scheme，地址与您的方式相同。它也很好！

 a1 <- data.frame(a = 1:5, b=letters[1:5]) a2 <- data.frame(a = 1:3, b=letters[1:3]) rows.in.a1.that.are.not.in.a2 <- function(a1,a2) { a1.vec <- apply(a1, 1, paste, collapse = "") a2.vec <- apply(a2, 1, paste, collapse = "") a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,] return(a1.without.a2.rows) } library(data.table) setDT(a1) setDT(a2) # no duplicates - as in example code r <- fsetdiff(a1, a2) all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2)) #[1] TRUE # handling duplicates - make some duplicates a1 <- rbind(a1, a1, a1) a2 <- rbind(a2, a2, a2) r <- fsetdiff(a1, a2, all = TRUE) all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2)) #[1] TRUE

它需要data.table 1.9.7目前可以从源代码回购安装

 install.packages("data.table", type = "source", repos = "https://Rdatatable.github.io/data.table")

也许这太简单了，但是我使用了这个解决scheme，当我有一个可以用来比较数据集的主键时，我发现它非常有用。希望它可以帮助。

 a1 <- data.frame(a = 1:5, b = letters[1:5]) a2 <- data.frame(a = 1:3, b = letters[1:3]) different.names <- (!a1$a %in% a2$a) not.in.a2 <- a1[different.names,]

你可以使用daff包（它使用V8包来包装daff.js库）：

 library(daff) diff_data(data_ref = a2, data = a1)

产生以下差异对象：

 Daff Comparison: 'a2' vs. 'a1' First 6 and last 6 patch lines: @@ ab 1 ... ... ... 2 3 c 3 +++ 4 d 4 +++ 5 e 5 ... ... ... 6 ... ... ... 7 3 c 8 +++ 4 d 9 +++ 5 e

diff格式在表格中用Coopy高亮度差异格式描述，应该是不言自明的。 +++在第一列@@中的行是在a1中是新的而在a2不存在的。

差异对象可以用于patch_data() ，使用patch_data()存储差异以便进行文档编制，或者使用write_diff() 可视化差异 ：

 render_diff( diff_data(data_ref = a2, data = a1) )

它生成一个整洁的HTML输出：

在这里输入图像说明

另一个解决scheme是基于plyr中的match_df。这是plyr的match_df：

 match_df <- function (x, y, on = NULL) { if (is.null(on)) { on <- intersect(names(x), names(y)) message("Matching on: ", paste(on, collapse = ", ")) } keys <- join.keys(x, y, on) x[keys$x %in% keys$y, , drop = FALSE] }

我们可以修改它否定：

 library(plyr) negate_match_df <- function (x, y, on = NULL) { if (is.null(on)) { on <- intersect(names(x), names(y)) message("Matching on: ", paste(on, collapse = ", ")) } keys <- join.keys(x, y, on) x[!(keys$x %in% keys$y), , drop = FALSE] }

然后：

 diff <- negate_match_df(a1,a2)

比较两个data.frames来查找data.frame 1中不存在于data.frame 2中的行

如何组合两个基于两列的数据框？

JPA合并与持续

为什么3路合并优于2路合并？

在git中获取合并提交的父母

使用merge..output获取source.id和target.id之间的映射

Subversion分支在v1.6重新整合

将DLLembedded到已编译的可执行文件中

git：合并两个分支：什么方向？

在Mercurial上倒退合并

通过拉取请求撤消合并？