计算文件中单词频率的优雅方法

什么是优雅和有效的方法来计算每个“英语”单词在文件中的频率？

首先，我定义了letter_only std::locale ，以便忽略来自stream的标点符号，并从inputstream中只读取有效的“英文”字母。这样，这个stream将会把"ways" ， "ways."这些单词对待"ways." 和"ways!" 就像同一个词"ways" ，因为这个stream将忽略"."这样的标点符号"." 和"!" 。

 struct letter_only: std::ctype<char> { letter_only(): std::ctype<char>(get_table()) {} static std::ctype_base::mask const* get_table() { static std::vector<std::ctype_base::mask> rc(std::ctype<char>::table_size,std::ctype_base::space); std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha); return &rc[0]; } };

解决scheme1

 int main() { std::map<std::string, int> wordCount; ifstream input; input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters! input.open("filename.txt"); std::string word; while(input >> word) { ++wordCount[word]; } for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it) { cout << it->first <<" : "<< it->second << endl; } }

解决scheme2

 struct Counter { std::map<std::string, int> wordCount; void operator()(const std::string & item) { ++wordCount[item]; } operator std::map<std::string, int>() { return wordCount; } }; int main() { ifstream input; input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters! input.open("filename.txt"); istream_iterator<string> start(input); istream_iterator<string> end; std::map<std::string, int> wordCount = std::for_each(start, end, Counter()); for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it) { cout << it->first <<" : "<< it->second << endl; } }

这里是工作解决scheme。这应该与真实文本（包括标点符号）一起工作：

 #include <iterator> #include <iostream> #include <fstream> #include <map> #include <string> #include <cctype> std::string getNextToken(std::istream &in) { char c; std::string ans=""; c=in.get(); while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters { c=in.get(); } while(std::isalpha(c)) { ans.push_back(std::tolower(c)); c=in.get(); } return ans; } int main() { std::map<std::string,int> words; std::ifstream fin("input.txt"); std::string s; std::string empty =""; while((s=getNextToken(fin))!=empty ) ++words[s]; for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter) std::cout<<iter->first<<' '<<iter->second<<std::endl; }

编辑：现在我的代码调用tolower每个字母。

我的解决scheme是以下一个。首先，所有的符号都被转换成空格。然后，基本上使用前面提供的相同的解决scheme来提取单词：

 const std::string Symbols = ",;.:-()\t!¡¿?\"[]{}&<>+-*/=#'"; typedef std::map<std::string, unsigned int> WCCollection; void countWords(const std::string fileName, WCCollection &wcc) { std::ifstream input( fileName.c_str() ); if ( input.is_open() ) { std::string line; std::string word; while( std::getline( input, line ) ) { // Substitute punctuation symbols with spaces for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) { if ( Symbols.find( *it ) != std::string::npos ) { *it = ' '; } } // Let std::operator>> separate by spaces std::istringstream filter( line ); while( filter >> word ) { ++( wcc[word] ); } } } }

一个algorithm的伪代码，我相信是接近你想要的：

 counts = defaultdict(int) for line in file: for word in line.split(): if any(x.isalpha() for x in word): counts[word.toupper()] += 1 freq = sorted(((count, word) for word, count in counts.items()), reversed=True) for count, word in freq: print "%d\t%s" % (count, word)

不区分大小写的比较是天真地处理的，可能将不想在绝对一般意义上进行组合的单词相结合。在上面的实现中要小心非ASCII字符。误报可能包括“1-800-555-TELL”，“0xDEADBEEF”和“42公里”，具体取决于你想要的。错过的单词包括“911应急服务”（我可能要这个数字作为三个单词）。

简而言之，自然语言parsing是很难的：根据您的实际使用情况，您可能会做出一些近似解释。

Perl可以说没有那么优雅，但非常有效。
我在这里发布了一个解决scheme：处理大文本文件

简而言之，

1）如果需要，去掉标点符号并将大写字母转换为小写字母：
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/AZ/az/" file_raw > file

2）计算每个单词的出现次数。打印结果按频率sorting，然后按字母顺序sorting：
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

我用一个有5.8亿字的3.3GB文本文件运行这个代码。
Perl 5.22在3分钟内完成。

一个更简单的方法是计算文件中的空格数，直到find多个空格为止，如果只考虑单词之间的单个空格…

确切地说，你的意思是“一个英文单词”。定义应该包括“健全”是一个字还是两个字，如何处理撇号（“不要相信他们！”），是否大写等等。
创build一组testing用例，以确保您能够正确地完成步骤1中的所有决定。
创build一个标记器，从input中读取下一个单词（如步骤1中定义的），并以标准forms返回。根据您的定义，这可能是一个简单的状态机，一个正则expression式，或者只依赖<istream>的提取操作符（例如std::cin >> word; ）。使用步骤2中的所有testing用例testing您的标记器。
select一个数据结构来保留单词和计数。在现代C ++中，你最终可能会得到像std::map<std::string, unsigned>或std::unordered_map<std::string, int> 。
编写一个循环，从分词器中获取下一个单词，并在直方图中增加其计数，直到input中不再有单词为止。

计算文件中单词频率的优雅方法

解决scheme1

解决scheme2

.NET中的线程安全集合

链接c代码到c ++代码时，g ++链接顺序依赖关系

当使用hash_map时，在stlstring上使用什么最好的散列algorithm？

从视图模型（C＃）设置WPF中的文本框焦点

为什么不emplace_back（）使用统一的初始化？

C和C ++中的Const正确性

什么C？！??！运营商呢？

为什么通用ICollection在.NET 4.5中实现IReadOnlyCollection？

无论结果如何，支持除零的最快整数除法是什么？

我怎么知道内联函数是否在被调用的地方被replace了？

计算文件中单词频率的优雅方法

解决scheme1

解决scheme2

.NET中的线程安全集合

链接c代码到c ++代码时，g ++链接顺序依赖关系

当使用hash_map时，在stlstring上使用什么最好的散列algorithm？

从视图模型（C＃）设置WPF中的文本框焦点

为什么不emplace_back（）使用统一的初始化？

C和C ++中的Const正确性

什么C？！??！ 运营商呢？

为什么通用ICollection在.NET 4.5中实现IReadOnlyCollection？

无论结果如何，支持除零的最快整数除法是什么？

我怎么知道内联函数是否在被调用的地方被replace了？

什么C？！??！运营商呢？