Word频率计数Java 8

如何计算Java 8中List的单词频率?

List <String> = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao"); 

结果必须是:

 {ciao=2, hello=1, bye=2} 

我想分享我find的解决scheme,因为起初我希望使用map-and-reduce方法,但是它有点不同。

 Map<String, Long> collect = wordsList.stream().collect(groupingBy(Function.identity(), counting())); 

或者整数值:

 Map<String, Integer> collect = wordsList.stream().collect(groupingBy(Function.identity(), summingInt(e -> 1))); 

编辑

我添加如何按照值sorting地图:

 LinkedHashMap<String, Long> countByWordSorted = collect.entrySet() .stream() .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())) .collect(Collectors.toMap( Map.Entry::getKey, Map.Entry::getValue, (v1, v2) -> { throw new IllegalStateException(); }, LinkedHashMap::new )); 

注意:请参阅下面的编辑

作为Mounas答案的替代方法, 下面是一个并行计算的方法:

 import java.util.Arrays; import java.util.List; import java.util.Map; import java.util.stream.Collectors; public class ParallelWordCount { public static void main(String[] args) { List<String> list = Arrays.asList( "hello", "bye", "ciao", "bye", "ciao"); Map<String, Integer> counts = list.parallelStream(). collect(Collectors.toConcurrentMap( w -> w, w -> 1, Integer::sum)); System.out.println(counts); } } 

编辑在回应评论时,我用JMH进行了一个小testing,比较了toConcurrentMapgroupingByConcurrent方法,不同的input列表大小和不同长度的随机单词。 这个testing表明toConcurrentMap方法更快。 当考虑这些方法是如何不同的时候,很难预测这样的事情。

作为进一步的扩展,基于进一步的评论,我扩展了testing覆盖toMapgroupingBy ,串行和并行的所有四种组合。

结果仍然是, toMap方法更快,但意外(至less,对我来说)在这两种情况下的“并发”版本比串行版本慢…:

  (method) (count) (wordLength) Mode Cnt Score Error Units toConcurrentMap 1000 2 avgt 50 146,636 ± 0,880 us/op toConcurrentMap 1000 5 avgt 50 272,762 ± 1,232 us/op toConcurrentMap 1000 10 avgt 50 271,121 ± 1,125 us/op toMap 1000 2 avgt 50 44,396 ± 0,541 us/op toMap 1000 5 avgt 50 46,938 ± 0,872 us/op toMap 1000 10 avgt 50 46,180 ± 0,557 us/op groupingBy 1000 2 avgt 50 46,797 ± 1,181 us/op groupingBy 1000 5 avgt 50 68,992 ± 1,537 us/op groupingBy 1000 10 avgt 50 68,636 ± 1,349 us/op groupingByConcurrent 1000 2 avgt 50 231,458 ± 0,658 us/op groupingByConcurrent 1000 5 avgt 50 438,975 ± 1,591 us/op groupingByConcurrent 1000 10 avgt 50 437,765 ± 1,139 us/op toConcurrentMap 10000 2 avgt 50 712,113 ± 6,340 us/op toConcurrentMap 10000 5 avgt 50 1809,356 ± 9,344 us/op toConcurrentMap 10000 10 avgt 50 1813,814 ± 16,190 us/op toMap 10000 2 avgt 50 341,004 ± 16,074 us/op toMap 10000 5 avgt 50 535,122 ± 24,674 us/op toMap 10000 10 avgt 50 511,186 ± 3,444 us/op groupingBy 10000 2 avgt 50 340,984 ± 6,235 us/op groupingBy 10000 5 avgt 50 708,553 ± 6,369 us/op groupingBy 10000 10 avgt 50 712,858 ± 10,248 us/op groupingByConcurrent 10000 2 avgt 50 901,842 ± 8,685 us/op groupingByConcurrent 10000 5 avgt 50 3762,478 ± 21,408 us/op groupingByConcurrent 10000 10 avgt 50 3795,530 ± 32,096 us/op 

我对JMH没有太多经验,也许我在这里做了一些错误 – build议和更正是值得欢迎的:

 import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Random; import java.util.concurrent.TimeUnit; import java.util.function.Function; import java.util.stream.Collectors; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.BenchmarkMode; import org.openjdk.jmh.annotations.Mode; import org.openjdk.jmh.annotations.OutputTimeUnit; import org.openjdk.jmh.annotations.Param; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.Setup; import org.openjdk.jmh.annotations.State; import org.openjdk.jmh.infra.Blackhole; @State(Scope.Thread) public class ParallelWordCount { @Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"}) public String method; @Param({"2", "5", "10"}) public int wordLength; @Param({"1000", "10000" }) public int count; private List<String> list; @Setup public void initList() { list = createRandomStrings(count, wordLength, new Random(0)); } @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MICROSECONDS) public void testMethod(Blackhole bh) { if (method.equals("toMap")) { Map<String, Integer> counts = list.stream().collect( Collectors.toMap( w -> w, w -> 1, Integer::sum)); bh.consume(counts); } else if (method.equals("toConcurrentMap")) { Map<String, Integer> counts = list.parallelStream().collect( Collectors.toConcurrentMap( w -> w, w -> 1, Integer::sum)); bh.consume(counts); } else if (method.equals("groupingBy")) { Map<String, Long> counts = list.stream().collect( Collectors.groupingBy( Function.identity(), Collectors.<String>counting())); bh.consume(counts); } else if (method.equals("groupingByConcurrent")) { Map<String, Long> counts = list.parallelStream().collect( Collectors.groupingByConcurrent( Function.identity(), Collectors.<String> counting())); bh.consume(counts); } } private static String createRandomString(int length, Random random) { StringBuilder sb = new StringBuilder(); for (int i = 0; i < length; i++) { int c = random.nextInt(26); sb.append((char) (c + 'a')); } return sb.toString(); } private static List<String> createRandomStrings( int count, int length, Random random) { List<String> list = new ArrayList<String>(count); for (int i = 0; i < count; i++) { list.add(createRandomString(length, random)); } return list; } } 

时间只与具有10000个元素的列表的序列情况和2个字母的单词类似。

检查更大的列表大小是否值得,并发版本最终会胜过串行版本,但是目前没有时间用所有这些configuration运行另一个详细的基准testing。

如果您使用Eclipse集合 ,您可以将List转换为一个Bag

 Bag<String> words = Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag(); Assert.assertEquals(2, words.occurrencesOf("ciao")); Assert.assertEquals(1, words.occurrencesOf("hello")); Assert.assertEquals(2, words.occurrencesOf("bye")); 

此代码将与Java 5 – 8一起工作。

注意:我是Eclipse集合的提交者

我会在这里介绍我所做的解决scheme(与分组的一个更好:))。

 static private void test0(List<String> input) { Set<String> set = input.stream() .collect(Collectors.toSet()); set.stream() .collect(Collectors.toMap(Function.identity(), str -> Collections.frequency(input, str))); } 

只是我的0.02 $

我的另外2分,给了一个arrays:

 import static java.util.stream.Collectors.*; String[] str = {"hello", "bye", "ciao", "bye", "ciao"}; Map<String, Integer> collected = Arrays.stream(str) .collect(groupingBy(Function.identity(), collectingAndThen(counting(), Long::intValue)));