Ruby:如何查找并返回数组中的重复值?

arr是string数组,例如: ["hello", "world", "stack", "overflow", "hello", "again"]

什么是简单和优雅的方式来检查arr是否有重复,如果是的话,返回其中的一个(不pipe是哪一个)。

例子:

 ["A", "B", "C", "B", "A"] # => "A" or "B" ["A", "B", "C"] # => nil 
 a = ["A", "B", "C", "B", "A"] a.detect{ |e| a.count(e) > 1 } 

你可以用几种方法来做到这一点,第一个select是最快的:

 ary = ["A", "B", "C", "B", "A"] ary.group_by{ |e| e }.select { |k, v| v.size > 1 }.map(&:first) ary.sort.chunk{ |e| e }.select { |e, chunk| chunk.size > 1 }.map(&:first) 

而一个O(N ^ 2)选项(即效率较低):

 ary.select{ |e| ary.count(e) > 1 }.uniq 

只需find对象索引(从左数起)不等于对象索引(从右数)的第一个实例。

 arr.detect {|e| arr.rindex(e) != arr.index(e) } 

如果没有重复,则返回值为零。

我相信这是迄今为止在线程中发布的最快的解决scheme,因为它不依赖于创build其他对象,并且在C中实现#index#rindex 。大O运行时为N ^ 2因此比塞尔吉奥慢,但由于“慢”部分在C中运行,所以壁时间可以更快。

detect只发现一个重复。 find_all将会find它们全部:

 a = ["A", "B", "C", "B", "A"] a.find_all { |e| a.count(e) > 1 } 

Ruby数组对象有一个很好的方法, select

 select {|item| block } → new_ary select → an_enumerator 

第一种forms是你在这里感兴趣的东西。 它允许您select通过testing的对象。

Ruby数组对象有另一个方法count

 count → int count(obj) → int count { |item| block } → int 

在这种情况下,你有兴趣重复(在数组中出现多次的对象)。 适当的testing是a.count(obj) > 1

如果a = ["A", "B", "C", "B", "A"] ,那么

 a.select{|item| a.count(item) > 1}.uniq => ["A", "B"] 

你声明你只需要一个对象。 所以select一个。

这里有两种find重复的方法。

使用一套

 require 'set' def find_a_dup_using_set(arr) s = Set.new arr.find { |e| !s.add?(e) } end find_a_dup_using_set arr #=> "hello" 

使用select来代替find来返回所有重复的数组。

使用Array#difference

 class Array def difference(other) h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 } reject { |e| h[e] > 0 && h[e] -= 1 } end end def find_a_dup_using_difference(arr) arr.difference(arr.uniq).first end find_a_dup_using_difference arr #=> "hello" 

拖放。首先返回所有重复的数组。

如果没有重复,两个方法都返回nil

我build议将Array#difference添加到Ruby核心。 更多的信息在这里我的答案。

基准

我们来比较一下build议的方法 首先,我们需要一个数组进行testing:

 CAPS = ('AAA'..'ZZZ').to_a.first(10_000) def test_array(nelements, ndups) arr = CAPS[0, nelements-ndups] arr = arr.concat(arr[0,ndups]).shuffle end 

以及运行不同testing数组的基准的方法:

 require 'fruity' def benchmark(nelements, ndups) arr = test_array nelements, ndups puts "\n#{ndups} duplicates\n" compare( Naveed: -> {arr.detect{|e| arr.count(e) > 1}}, Sergio: -> {(arr.inject(Hash.new(0)) {|h,e| h[e] += 1; h}.find {|k,v| v > 1} || [nil]).first }, Ryan: -> {(arr.group_by{|e| e}.find {|k,v| v.size > 1} || [nil]).first}, Chris: -> {arr.detect {|e| arr.rindex(e) != arr.index(e)} }, Cary_set: -> {find_a_dup_using_set(arr)}, Cary_diff: -> {find_a_dup_using_set(arr)} ) end 

我没有包含@JjP的答案,因为只有一个副本将被返回,并且当他/她的答案被修改为这样做时,它与@Naveed先前的答案相同。 也没有包括@ Marin的回答,虽然在Naveed的回答之前发布了回答,但是回复了所有的重复,而不仅仅是一个(小的一点,但是没有评估两个,因为当返回一个重复时它们是相同的)。

我也修改了其他的答案,返回所有重复只返回find的第一个,但这应该基本上没有影响性能,因为他们计算所有重复之前select一个。

首先假设数组包含100个元素:

 benchmark(100, 0) 0 duplicates Running each test 64 times. Test will take about 2 seconds. Cary_set is similar to Cary_diff Cary_diff is similar to Ryan Ryan is similar to Sergio Sergio is faster than Chris by 4x ± 1.0 Chris is faster than Naveed by 2x ± 1.0 benchmark(100, 1) 1 duplicates Running each test 128 times. Test will take about 2 seconds. Cary_set is similar to Cary_diff Cary_diff is faster than Ryan by 2x ± 1.0 Ryan is similar to Sergio Sergio is faster than Chris by 2x ± 1.0 Chris is faster than Naveed by 2x ± 1.0 benchmark(100, 10) 10 duplicates Running each test 1024 times. Test will take about 3 seconds. Chris is faster than Naveed by 2x ± 1.0 Naveed is faster than Cary_diff by 2x ± 1.0 (results differ: AAC vs AAF) Cary_diff is similar to Cary_set Cary_set is faster than Sergio by 3x ± 1.0 (results differ: AAF vs AAC) Sergio is similar to Ryan 

现在考虑一个有10,000个元素的数组:

 benchmark(10000, 0) 0 duplicates Running each test once. Test will take about 4 minutes. Ryan is similar to Sergio Sergio is similar to Cary_set Cary_set is similar to Cary_diff Cary_diff is faster than Chris by 400x ± 100.0 Chris is faster than Naveed by 3x ± 0.1 benchmark(10000, 1) 1 duplicates Running each test once. Test will take about 1 second. Cary_set is similar to Cary_diff Cary_diff is similar to Sergio Sergio is similar to Ryan Ryan is faster than Chris by 2x ± 1.0 Chris is faster than Naveed by 2x ± 1.0 benchmark(10000, 10) 10 duplicates Running each test once. Test will take about 11 seconds. Cary_set is similar to Cary_diff Cary_diff is faster than Sergio by 3x ± 1.0 (results differ: AAE vs AAA) Sergio is similar to Ryan Ryan is faster than Chris by 20x ± 10.0 Chris is faster than Naveed by 3x ± 1.0 benchmark(10000, 100) 100 duplicates Cary_set is similar to Cary_diff Cary_diff is faster than Sergio by 11x ± 10.0 (results differ: ADG vs ACL) Sergio is similar to Ryan Ryan is similar to Chris Chris is faster than Naveed by 3x ± 1.0 

请注意,如果在C中实现了Array#difference ,那么find_a_dup_using_difference(arr)将会更加高效,如果它被添加到Ruby核心,情况就是这样。

find_all()返回一个包含所有enum元素的array ,其中block不是false

获取duplicate元素

 >> arr = ["A", "B", "C", "B", "A"] >> arr.find_all { |x| arr.count(x) > 1 } => ["A", "B", "B", "A"] 

或者重复uniq元素

 >> arr.find_all { |x| arr.count(x) > 1 }.uniq => ["A", "B"] 

像这样的东西将工作

 arr = ["A", "B", "C", "B", "A"] arr.inject(Hash.new(0)) { |h,e| h[e] += 1; h }. select { |k,v| v > 1 }. collect { |x| x.first } 

也就是说,把所有的值都放在一个散列中,其中键是数组的元素,值是出现次数。 然后select不止一次出现的所有元素。 简单。

我知道这个线程是专门针对Ruby的,但是我在这里着手寻找如何在ActiveRecord的Ruby on Rails环境中实现这一点,并认为我也会分享我的解决scheme。

 class ActiveRecordClass < ActiveRecord::Base #has two columns, a primary key (id) and an email_address (string) end ActiveRecordClass.group(:email_address).having("count(*) > 1").count.keys 

上面的例子返回了在这个例子的数据库表中复制的所有电子邮件地址的数组(在Rails中它将是“active_record_classes”)。

 a = ["A", "B", "C", "B", "A"] a.each_with_object(Hash.new(0)) {|i,hash| hash[i] += 1}.select{|_, count| count > 1}.keys 

这是一个O(n)程序。

或者,您可以执行以下任一行。 也是O(n)但只有一个迭代

 a.each_with_object(Hash.new(0).merge dup: []){|x,h| h[:dup] << x if (h[x] += 1) == 2}[:dup] a.inject(Hash.new(0).merge dup: []){|h,x| h[:dup] << x if (h[x] += 1) == 2;h}[:dup] 

唉大部分答案是O(n^2)

这是一个O(n)解决scheme,

 a = %w{the quick brown fox jumps over the lazy dog} h = Hash.new(0) a.find { |each| (h[each] += 1) == 2 } # => 'the" 

这是什么复杂的?

  • O(n)运行并在首场比赛中rest
  • 使用O(n)内存,但只有最小的数量

现在,根据数组中重复项的频繁程度,这些运行时可能实际上变得更好了。 例如,如果大小为O(n)的数组已经从k << n不同元素的总体中采样,那么只有运行时和空间的复杂度变为O(k) ,然而原始海报更有可能validationinput并希望确保没有重复。 在这种情况下,运行时间和内存复杂度为O(n)因为我们期望这些元素对大多数input没有重复。

这里是我在一大组数据 – 如传统的dBase表中find重复的部分

 # Assuming ps is an array of 20000 part numbers & we want to find duplicates # actually had to it recently. # having a result hash with part number and number of times part is # duplicated is much more convenient in the real world application # Takes about 6 seconds to run on my data set # - not too bad for an export script handling 20000 parts h = {}; # or for readability h = {} # result hash ps.select{ |e| ct = ps.count(e) h[e] = ct if ct > 1 }; nil # so that the huge result of select doesn't print in the console 

另一个重复词是交集。 使用&运算符:

 a = ['a', 'b', 'c', 'd'] b = ['c', 'd', 'e', 'f'] a & b => ['c', 'd'] 
 a = ["A", "B", "C", "B", "A"] b = a.select {|e| a.count(e) > 1}.uniq c = a - b d = b + c 

结果

  d => ["A", "B", "C"] 
 r = [1, 2, 3, 5, 1, 2, 3, 1, 2, 1] r.group_by(&:itself).map { |k, v| v.size > 1 ? [k] + [v.size] : nil }.compact.sort_by(&:last).map(&:first) 
 def firstRepeatedWord(string) h_data = Hash.new(0) string.split(" ").each{|x| h_data[x] +=1} h_data.key(h_data.values.max) end