build立给定文本中最常用词的ASCII图表

挑战:

build立给定文本中最常用单词的ASCII图表。

规则:

  • 只接受azAZ (字母字符)作为单词的一部分。
  • 忽略套pipe( She == she为我们的目的)。
  • 忽略下面的话(我知道这是相当的任意的): the, and, of, to, a, i, it, in, or, is
  • 澄清:考虑don't :这将被视为azAZ范围内的两个不同的“单词”:( dont )。

  • 可选 (现在正式更改规范为时已晚),您可以select放弃所有单个字母的单词(这可能会缩短忽略列表)。

parsing给定的text (读取通过命令行参数指定的文件或通过input;假设us-ascii )并为我们构build一个具有以下特征的word frequency chart

  • 显示22个最常用单词(按降序排列)的图表(请参阅下面的示例)。
  • width表示单词的出现次数(按比例)。 追加一个空格并打印单词。
  • 确保这些条(总是空格 – 空格)总是适合的bar + [space] + word + [space]应该总是<= 80字符(确保考虑到可能有不同的条和字长度:例如:最常用的单词可能会比第一个单词长得多,而在频率上差别不大)。 在这些约束条件下最大化条宽,并适当缩放条(根据它们所代表的频率)。

一个例子:

这个例子的文本可以在这里find ( 刘易斯·卡罗尔的“爱丽丝梦游仙境” )。

这个特定的文本将产生下面的图表:

  _________________________________________________________________________
 | _________________________________________________________________________ | 她 
 | _______________________________________________________________ | 您 
 | ____________________________________________________________ | 说过 
 | ____________________________________________________ | 爱丽丝 
 | ______________________________________________ | 是 
 | __________________________________________ | 那 
 | ___________________________________ | 如 
 | _______________________________ | 她的 
 | ____________________________ | 同 
 | ____________________________ | 在 
 | ___________________________ | 小号 
 | ___________________________ |  Ť 
 | _________________________ | 上 
 | _________________________ | 所有 
 | ______________________ | 这个 
 | ______________________ | 对于 
 | ______________________ | 有 
 | _____________________ | 但 
 | ____________________ | 是 
 | ____________________ | 不 
 | ___________________ | 他们 
 | __________________ | 所以 


您的信息:这些是以上图表的频率:

 (“她”,553),('你',481),('说',462),('爱丽丝',403),('是',358),
 (''','),('''',''),'
 ('''',''),('''',''),('''',''),
但'','175','''','167','''','166',''他们',155),('so',152)]

第二个示例(检查是否实现了完整的规范):用superlongstringstringreplace链接的Alice in Wonderland文件中的每个事件:

  ________________________________________________________________
 | ________________________________________________________________ | 她 
 | _______________________________________________________ |  superlongstringstring 
 | _____________________________________________________ | 说过 
 | ______________________________________________ | 爱丽丝 
 | ________________________________________ | 是 
 | _____________________________________ | 那 
 | ______________________________ | 如 
 | ___________________________ | 她的 
 | _________________________ | 同 
 | _________________________ | 在 
 | ________________________ | 小号 
 | ________________________ |  Ť 
 | ______________________ | 上 
 | _____________________ | 所有 
 | ___________________ | 这个 
 | ___________________ | 对于 
 | ___________________ | 有 
 | __________________ | 但 
 | _________________ | 是 
 | _________________ | 不 
 | ________________ | 他们 
 | ________________ | 所以 

获胜者,冠军:

最短的解决scheme(按字符数,每种语言)。 玩的开心!


编辑 :目前为止总结结果表(2012-02-15)(最初由用户Nas Banov添加):

语言宽松严格
 ========= ======= ======
 GolfScript 130 143
 Perl 185
 Windows PowerShell 148 199
 Mathematica 199
ruby185 205
 Unix工具链194 228
 Python 183 243
 Clojure 282
斯卡拉311
哈斯克尔333
 Awk 336
 R 298
 Javascript 304 354
 Groovy 321
 Matlab 404
 C#422
 Smalltalk 386
 PHP 450
 F#452
 TSQL 483 507

数字代表特定语言中最短解的长度。 “严格”是指完全实现规范的解决scheme(绘制|____|条,用第____行closures第一栏,说明高频长词的可能性等)。 “轻松”意味着一些自由缩短了解决scheme。

只包含500个字符以内的解决scheme。 语言列表按“严格”解决scheme的长度sorting。 'Unix Toolchain'用来表示使用传统的* nixshell多种工具(如grep,tr,sort,uniq,head,perl,awk)的各种解决scheme。

LabVIEW 51节点,5个结构,10个图表

教大象去踢踏舞永远不会很漂亮。 我会啊,跳过字符数。

labVIEW代码

结果

程序从左到右:

labVIEW代码解释

ruby1.9,185字

(主要基于其他Ruby解决scheme)

 w=($<.read.downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22] k,l=w[0] puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}] 

不像其他解决scheme那样使用任何命令行开关,只需传递文件名即可。 (即ruby1.9 wordfrequency.rb Alice.txt

由于我在这里使用字符文字,所以这个解决scheme只适用于Ruby 1.9。

编辑:由“可读性”的换行符replace分号。 :P

编辑2:Shtééf指出,我忘记了尾随的空间 – 固定的。

编辑3:再次删除尾部空间;)

GolfScript, 177 175 173 167 164 163 144 131 130个字符

慢 – 样本文本3分钟(130)

 {32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{" |"\~1*2/0*'| '@}/ 

说明:

 { #loop through all characters 32|. #convert to uppercase and duplicate 123%97< #determine if is a letter n@if #return either the letter or a newline }% #return an array (of ints) ]''* #convert array to a string with magic n% #split on newline, removing blanks (stack is an array of words now) "oftoitinorisa" #push this string 2/ #split into groups of two, ie ["of" "to" "it" "in" "or" "is" "a"] - #remove any occurrences from the text "theandi"3/-#remove "the", "and", and "i" $ #sort the array of words (1@ #takes the first word in the array, pushes a 1, reorders stack #the 1 is the current number of occurrences of the first word { #loop through the array .3$>1{;)}if#increment the count or push the next word and a 1 }/ ]2/ #gather stack into an array and split into groups of 2 {~~\;}$ #sort by the latter element - the count of occurrences of each word 22< #take the first 22 elements .0=~:2; #store the highest count ,76\-:1 #store the length of the first line '_':0*' '\@ #make the first line { #loop through each word " |"\~ #start drawing the bar 1*2/0 #divide by zero *'| '@ #finish drawing the bar }/ 

“正确”(希望)。 (143)

 {32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{" |"\~1*^/0*'| '@}/ 

慢一点 – 半分钟。 (162)

 '"'/' ':S*n/S*'"#{%q '\+" .downcase.tr('^a-z',' ')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{" |"\~1*2/0*'| '@}/ 

输出在修订日志中可见。

206

shell,grep,tr,grep,sort,uniq,sort,head,perl

 ~ % wc -c wfg 209 wfg ~ % cat wfg egrep -oi \\b[az]+|tr AZ az|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"' ~ % # usage: ~ % sh wfg < 11.txt 

hm,刚刚在上面看到: sort -nr – > sort -n然后head – > tail => 208 🙂
update2:erm,当然以上是愚蠢的,因为它会被颠倒过来。 那么,209。
update3:优化了排除正则expression式 – > 206

 egrep -oi \\b[az]+|tr AZ az|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"' 

为了好玩,这是一个perl-only版本(更快):

 ~ % wc -c pgolf 204 pgolf ~ % cat pgolf perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([az]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w' ~ % # usage: ~ % sh pgolf < 11.txt 

基于Transact SQL集的解决scheme(SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630个字符

感谢Gabe提供了一些有用的build议来减less字符数量。

注意:添加换行符以避免滚动条只需要最后一个换行符。

 DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A', SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING (@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D FROM N WHERE L LIKE'[AZ]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=bR FOR XML PATH (''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it', 'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+ REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @ 

可读版本

 DECLARE @ VARCHAR(MAX), @F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',SINGLE_BLOB)x; /* Loads text file from path C:\WINDOWS\system32\A */ /*Recursive common table expression to generate a table of numbers from 1 to string length (and associated characters)*/ WITH N AS (SELECT 1 i, LEFT(@,1)L UNION ALL SELECT i+1, SUBSTRING(@,i+1,1) FROM N WHERE i<LEN(@) ) SELECT i, L, i-RANK()OVER(ORDER BY i)R /*Will group characters from the same word together*/ INTO #D FROM N WHERE L LIKE'[AZ]'OPTION(MAXRECURSION 0) /*Assuming case insensitive accent sensitive collation*/ SELECT TOP 22 W, -COUNT(*)C INTO # FROM (SELECT DISTINCT R, (SELECT ''+L FROM #D WHERE R=bR FOR XML PATH('') )W /*Reconstitute the word from the characters*/ FROM #D b ) T WHERE LEN(W)>1 AND W NOT IN('the', 'and', 'of' , 'to' , 'it' , 'in' , 'or' , 'is') GROUP BY W ORDER BY C /*Just noticed this looks risky as it relies on the order of evaluation of the variables. I'm not sure that's guaranteed but it works on my machine :-) */ SELECT @F=MIN(($76-LEN(W))/-C), @ =' ' +REPLICATE('_',-MIN(C)*@F)+' ' FROM # SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @ 

产量

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| You |____________________________________________________________| said |_____________________________________________________| Alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| at |_____________________________| with |__________________________| on |__________________________| all |_______________________| This |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| So |___________________| very |__________________| what 

并与长串

  _______________________________________________________________ |_______________________________________________________________| she |_______________________________________________________| superlongstringstring |____________________________________________________| said |______________________________________________| Alice |________________________________________| was |_____________________________________| that |_______________________________| as |____________________________| her |_________________________| at |_________________________| with |_______________________| on |______________________| all |____________________| This |____________________| for |____________________| had |____________________| but |___________________| be |__________________| not |_________________| they |_________________| So |________________| very |________________| what 

ruby207 213 211 210 207 203 201 200字符

对Anurag的改进,包含rfusca的build议。 也消除了sorting和其他一些小打高尔夫球的理由。

 w=(STDIN.read.downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "} 

执行如下:

 ruby GolfedWordFrequencies.rb < Alice.txt 

编辑:把'放'回来,需要在那里,以避免在输出报价。
编辑2:更改文件 – > IO
编辑3:删除/我
编辑4:删除(f * 1.0)括号,叙述
编辑5:为第一行使用string添加; 展开就地。
编辑6:使M浮动,删除1.0。 编辑:不起作用,更改长度。 编辑:没有比以前更糟糕
编辑7:使用STDIN.read

Mathematica( 297 284 248 244 242 199个字符)纯function

和Zipf法则testing

看妈妈…没有变数,没有手,没有头

编辑1>定义一些简写(284个字符)

 f[x_, y_] := Flatten[Take[x, All, y]]; BarChart[f[{##}, -1], BarOrigin -> Left, ChartLabels -> Placed[f[{##}, 1], After], Axes -> None ] & @@ Take[ SortBy[ Tally[ Select[ StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]], !MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&] ], Last], -22] 

一些解释

 Import[] # Get The File ToLowerCase [] # To Lower Case :) StringSplit[ STRING , RegularExpression["\\W+"]] # Split By Words, getting a LIST Select[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&] # Select from LIST except those words in LIST_TO_AVOID # Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the test Tally[LIST] # Get the LIST {word,word,..} and produce another {{word,counter},{word,counter}...} SortBy[ LIST ,Last] # Get the list produced bt tally and sort by counters Note that counters are the LAST element of {word,counter} Take[ LIST ,-22] # Once sorted, get the biggest 22 counters BarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST # Get the list produced by Take as input and produce a bar chart f[x_, y_] := Flatten[Take[x, All, y]] # Auxiliary to get the list of the first or second element of lists of lists x_ dependending upon y # So f[{##}, -1] is the list of counters # and f[{##}, 1] is the list of words (labels for the chart) 

产量

替代文字160wpb4.jpg

Mathematica不适合打高尔夫球,这只是因为很长的描述性function名称。 像“RegularExpression []”或“StringSplit []”的function只是让我sob :(。

齐夫的法律testing

Zipf定律预测,对于自然语言文本, Log(Rank)Log(出现)呈线性关系。

该法律用于开发用于criptography和数据压缩的algorithm。 (但是这不是LZWalgorithm中的“Z”)。

在我们的文本中,我们可以用以下方式进行testing

  f[x_, y_] := Flatten[Take[x, All, y]]; ListLogLogPlot[ Reverse[f[{##}, -1]], AxesLabel -> {"Log (Rank)", "Log Counter"}, PlotLabel -> "Testing Zipf's Law"] & @@ Take[ SortBy[ Tally[ StringSplit[ToLowerCase[b], RegularExpression["\\W+"]] ], Last], -1000] 

结果是(非常好的线性)

替代文字160wpb4.jpg

编辑6>(242个字符)

重构正则expression式(无select函数)
丢弃1个字符的单词
更高效的定义函数“f”

 f = Flatten[Take[#1, All, #2]]&; BarChart[ f[{##}, -1], BarOrigin -> Left, ChartLabels -> Placed[f[{##}, 1], After], Axes -> None] & @@ Take[ SortBy[ Tally[ StringSplit[ToLowerCase[Import[i]], RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]] ], Last], -22] 

编辑7→199个字符

 BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@ Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i, RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22] 
  • TransposeSlot#1 / #2 )参数replacef
  • 我们并不需要stinkin的括号(如果可能的话,使用f@x而不是f[x]

C# – 510 451 436 446 434 426 422字符(缩小)

没那么简单,但现在可能是正确的! 请注意,以前的版本没有显示栏的第一行,没有正确地缩放栏,下载文件,而不是从标准input,而不是包括所有需要的C#的详细程度。 如果C#不需要太多额外的废话,你可以轻松地刮很多笔画。 也许Powershell可以做得更好。

 using C=System.Console; // alias for Console using System.Linq; // for Split, GroupBy, Select, OrderBy, etc. class Class // must define a class { static void Main() // must define a Main { // split into words var allwords = System.Text.RegularExpressions.Regex.Split( // convert stdin to lowercase C.In.ReadToEnd().ToLower(), // eliminate stopwords and non-letters @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+") .GroupBy(x => x) // group by words .OrderBy(x => -x.Count()) // sort descending by count .Take(22); // take first 22 words // compute length of longest bar + word var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length)); // prepare text to print var toPrint = allwords.Select(x=> new { // remember bar pseudographics (will be used in two places) Bar = new string('_',(int)(x.Count()/lendivisor)), Word=x.Key }) .ToList(); // convert to list so we can index into it // print top of first bar C.WriteLine(" " + toPrint[0].Bar); toPrint.ForEach(x => // for each word, print its bar and the word C.WriteLine("|" + x.Bar + "| " + x.Word)); } } 

在下面的表格中使用lendivisor内联的422个字符(使得它慢22倍)(用于select空格的换行符):

 using System.Linq;using C=System.Console;class M{static void Main(){var a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+xt));}} 

Perl, 237 229 209个字符

(再次更新,以更多脏高尔夫球技巧击败Ruby版本split/[^az/,lclc=~/[az]+/greplacesplit/[^az/,lc ,并且在另一个地方取消空string的检查。ruby版本,所以信贷的信用到期。)

更新:现在用Perl 5.10! 用sayreplaceprint ,然后用~~来避开一张map 。 这必须在命令行中调用,如perl -E '<one-liner>' alice.txt 。 由于整个剧本是在一条线上,所以把它写成一行就不会有任何困难:)。

  @s=qw/the and of to ai it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[az]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21]; 

请注意,这个版本正常化的情况下。 这不会缩短解决scheme,因为删除,lc (下壳)需要你添加AZ到分割正则expression式,所以这是一个洗。

如果你在一个换行符是一个字符而不是两个字符的系统上,你可以用另一个字符来代替\n换行。 不过,我还没有这样写过上面的例子,因为这样更“清晰”(哈!)。


这是一个大部分正确的,但不是很短的perl解决scheme:

 use strict; use warnings; my %short = map { $_ => 1 } qw/the and of to ai it in or is/; my %count = (); $count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>); my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21]; my $widest = 76 - (length $sorted[0]); print " " . ("_" x $widest) . "\n"; foreach (@sorted) { my $width = int(($count{$_} / $count{$sorted[0]}) * $widest); print "|" . ("_" x $width) . "| $_ \n"; } 

以下是尽可能短,而相对可读。 (392个字符)。

 %short = map { $_ => 1 } qw/the and of to ai it in or is/; %count; $count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^az]/, lc } (<>); @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21]; $widest = 76 - (length $sorted[0]); print " " . "_" x $widest . "\n"; print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted; 

Windows PowerShell,199个字符

 $x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort * filter f($w){' '+'_'*$w $x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}} f(76..1|?{!((f $_)-match'.'*80)})[0] 

(最后一个换行符是没有必要的,但是为了便于阅读,在这里包括)

(当前的代码和我的testing文件可以在我的SVN仓库中find,我希望我的testing用例能够捕获最常见的错误(bar长度,正则expression式匹配问题以及其他一些问题))

假设:

  • 美国ASCII作为input。 Unicode可能会变得很奇怪。
  • 文中至less有两个不停的单词

历史

放松版 (137),因为现在分开计算,显然:

 ($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name} 
  • 不closures第一个栏
  • 不考虑非首字的字长

与其他解决scheme相比,一个字符的条形长度变化是由于PowerShell在将浮点数转换为整数时使用四舍五入而不是截断。 由于任务只需要比例长度,所以这应该没问题。

与其他解决scheme相比,我采取了一种稍微不同的方法来确定最长的钢筋长度,只需简单地尝试一下,如果长度不超过80个字符,则采用最高的钢筋长度。

一个较老的版本可以在这里find。

Ruby, 215,216,218,221,224,236,237个字符

更新1: 万岁 ! 这是JS Bangs的解决scheme 。 想不到一个办法来减less更多:)

更新2:玩过一个肮脏的高尔夫球技巧。 更改each map保存1个字符:)

更新3:将File.read更改为IO.read +2。 Array.group_by不是很有成果,改为reduce +6。 在正则expression式+1下面的下壳和下壳之后不需要大小写不敏感的检查。 按降序sorting很容易通过取反值+6来完成。 总节省+15

更新4: [0]而不是.first ,+ 3。 (@Shtééf)

更新5:展开variablesl in-place,+1。 展开variabless就地,+2。 (@Shtééf)

更新6:使用string添加而不是插值的第一行,+2。 (@Shtééf)

 w=(IO.read($_).downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "} 

更新7:我经历了大量的讨论,使用实例variables来检测循环的第一次迭代。 我得到的只是+1,虽然也许有潜力。 保留以前的版本,因为我相信这个是黑魔法。 (@Shtééf)

 (IO.read($_).downcase.scan(/[az]+/)-%w{the and of to ai it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "} 

可读版本

 string = File.read($_).downcase words = string.scan(/[az]+/i) allowed_words = words - %w{the and of to ai it in or is} sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22) highest_frequency = sorted_words.first highest_frequency_count = highest_frequency[1] highest_frequency_word = highest_frequency[0] word_length = highest_frequency_word.size widest = 76 - word_length puts " #{'_' * widest}" sorted_words.each do |word, freq| width = (freq * 1.0 / highest_frequency_count) * widest puts "|#{'_' * width}| #{word} " end 

使用:

 echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb 

输出:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

Python 2.x, latitudinarian approach = 227 183 chars

 import sys,re t=re.split('\W+',sys.stdin.read().lower()) r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22] for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w 

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion ( the, and, of, to, a, i, it, in, or, is ) – plus it also excludes the two infamous "words" s and t from the example – and I threw in for free the exclusion for an, for, he . I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings: itheandtoforinis and andithetoforinis .

PS。 borrowed from other solutions to shorten the code.

 =========================================================================== she ================================================================= you ============================================================== said ====================================================== alice ================================================ was ============================================ that ===================================== as ================================= her ============================== at ============================== with =========================== on =========================== all ======================== this ======================== had ======================= but ====================== be ====================== not ===================== they ==================== so =================== very =================== what ================= little 

Rant

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus used. Per one of the most popular lists ( http://en.wikipedia.org/wiki/Most_common_words_in_English , http://www.english-for-students.com/Frequently-Used-Words.html , http://www.sporcle.com/games/common_english_words.php ), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

So question is why or was included in the problem's ignore list, where it's ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

Alternative idea would be simply to skip the top 10 words from the result – which actually would shorten the solution (elementary – have to show only the 11th to 32nd entries).


Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

 import sys,re t=re.split('\W+',sys.stdin.read().lower()) r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22] h=min(9*l/(77-len(w))for l,w in r) print'',9*r[0][0]/h*'_' for l,w in r:print'|'+9*l/h*'_'+'|',w 

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to ai it in or is <"Alice's Adventures in Wonderland.txt"

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

PS。 The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

  _______________________________________________________________ |_______________________________________________________________| she |_______________________________________________________| superlongstringstring |_____________________________________________________| said |______________________________________________| alice |_________________________________________| was |______________________________________| that |_______________________________| as |____________________________| her |__________________________| at |__________________________| with |_________________________| s |_________________________| t |_______________________| on |_______________________| all |____________________| this |____________________| for |____________________| had |____________________| but |___________________| be |___________________| not |_________________| they |_________________| so 

Haskell – 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

 import Data.List import Data.Char l=length t=filter m=map fc|isAlpha c=toLower c|0<1=' ' hw=(-lw,head w) x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w q?(g,w)=q*(77-lw)`div`g bx=m(x!)x a(l:r)=(' ':t(=='_')l):l:r main=interact$unlines.abtake 22.sort.m h.group.sort .t(`notElem`words"the and of to ai it in or is").words.mf 

How it works is best seen by reading the argument to interact backwards:

  • map f lowercases alphabetics, replaces everything else with spaces.
  • words produces a list of words, dropping the separating whitespace.
  • filter ( notElem words "the and of to ai it in or is") discards all entries with forbidden words.
  • group . sort sorts the words, and groups identical ones into lists.
  • map h maps each list of identical words to a tuple of the form (-frequency, word) .
  • take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
  • b maps tuples to bars (see below).
  • a prepends the first line of underscores, to complete the topmost bar.
  • unlines joins all these lines together with newlines.

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps cx over x , where x is the list of histograms. The entire list is passed to c , so that each invocation of c can compute the scale factor for itself by calling u . In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

Note the trick of using -frequency . This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u , two -frequency values are multiplied, which will cancel the negation out.

JavaScript 1.8 (SpiderMonkey) – 354

 x={};p='|';e=' ';z=[];c=77 while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1})) z=z.sort(function(a,b)bc-ac).slice(0,22) for each(v in z){vr=vc/z[0].c c=c>(l=(77-vwlength)/vr)?l:c}for(k in z){v=z[k] s=Array(vr*c|0).join('_') if(!+k)print(e+s+e) print(p+s+p+e+vw)} 

Sadly, the for([k,v]in z) from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines….

Adding whitespace for readability:

 x={};p='|';e=' ';z=[];c=77 while(l=readline()) l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g, function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} ) ) z=z.sort(function(a,b) bc - ac).slice(0,22) for each(v in z){ vr=vc/z[0].c c=c>(l=(77-vwlength)/vr)?l:c } for(k in z){ v=z[k] s=Array(vr*c|0).join('_') if(!+k)print(e+s+e) print(p+s+p+e+vw) } 

Usage: js golf.js < input.txt

输出:

  _________________________________________________________________________ 
|_________________________________________________________________________| 她
|_______________________________________________________________| 您
|____________________________________________________________| said
|____________________________________________________| 爱丽丝
|______________________________________________| 是
|___________________________________________| 那
|___________________________________| 如
|________________________________| 她的
|_____________________________| 在
|_____________________________| 同
|____________________________| 小号
|____________________________|  Ť
|__________________________| 上
|_________________________| 所有
|_______________________| 这个
|______________________| 对于
|______________________| 有
|______________________| 但
|_____________________| 是
|_____________________| 不
|___________________| 他们
|___________________| 所以

(base version – doesn't handle bar widths correctly)

JavaScript (Rhino) – 405 395 387 377 368 343 304 chars

I think my sorting logic is off, but.. I duno. Brainfart fixed.

Minified (abusing \n 's interpreted as a ; sometimes):

 x={};p='|';e=' ';z=[] readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})}) z=z.sort(function(a,b){return bc-ac}).slice(0,22) for([k,v]in z){s=Array((vc/z[0].c)*70|0).join('_') if(!+k)print(e+s+e) print(p+s+p+e+vw)} 

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

Usage: php.exe <this.php> <file.txt>

Minified:

 <?php $a=array_count_values(array_filter(preg_split('/[^az]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?> 

Human readable:

 <?php // Read: $s = strtolower(file_get_contents($argv[1])); // Split: $a = preg_split('/[^az]/', $s, -1, PREG_SPLIT_NO_EMPTY); // Remove unwanted words: $a = array_filter($a, function($x){ return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x); }); // Count: $a = array_count_values($a); // Sort: arsort($a); // Pick top 22: $a=array_slice($a,0,22); // Recursive function to adjust bar widths // according to the last requirement: function R($a,$F,$B){ $r = array(); foreach($a as $x=>$f){ $l = strlen($x); $r[$x] = $b = $f * $B / $F; if ( $l + $b > 76 ) return R($a,$f,76-$l); } return $r; } // Apply the function: $c = R($a,max($a),76-strlen(key($a))); // Output: foreach ($a as $x => $f) echo '|',str_repeat('-',$c[$x]),"| $x\n"; ?> 

输出:

 |-------------------------------------------------------------------------| she |---------------------------------------------------------------| you |------------------------------------------------------------| said |-----------------------------------------------------| alice |-----------------------------------------------| was |-------------------------------------------| that |------------------------------------| as |--------------------------------| her |-----------------------------| at |-----------------------------| with |--------------------------| on |--------------------------| all |-----------------------| this |-----------------------| for |-----------------------| had |-----------------------| but |----------------------| be |---------------------| not |--------------------| they |--------------------| so |-------------------| very |------------------| what 

When there is a long word, the bars are adjusted properly:

 |--------------------------------------------------------| she |---------------------------------------------------| thisisareallylongwordhere |-------------------------------------------------| you |-----------------------------------------------| said |-----------------------------------------| alice |------------------------------------| was |---------------------------------| that |---------------------------| as |-------------------------| her |-----------------------| with |-----------------------| at |--------------------| on |--------------------| all |------------------| this |------------------| for |------------------| had |-----------------| but |-----------------| be |----------------| not |---------------| they |---------------| so |--------------| very 

Python 3.1 – 245 229 charaters

I guess using Counter is kind of cheating 🙂 I just read about it about a week ago, so this was the perfect chance to see how it works.

 import re,collections o=collections.Counter([w for w in re.findall("[az]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22) print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o)) 

打印出来:

 |____________________________________________________________________________| she |__________________________________________________________________| you |_______________________________________________________________| said |_______________________________________________________| alice |_________________________________________________| was |_____________________________________________| that |_____________________________________| as |__________________________________| her |_______________________________| with |_______________________________| at |______________________________| s |_____________________________| t |____________________________| on |___________________________| all |________________________| this |________________________| for |________________________| had |________________________| but |______________________| be |______________________| not |_____________________| they |____________________| so 

Some of the code was "borrowed" from AKX's solution.

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

原版的:

 $k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[az]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s ",'_'x$l;printf"|%s| $_ ",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21]; 

Latest version down to 191 characters:

 /^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[az]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s ";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s "}@e[0,0..21] 

Latest version down to 189 characters:

 /^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[az]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s ";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s "}@_[0,0..21] 

This version (205 char) accounts for the lines with words longer than what would be found later.

 /^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[az]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s ";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s ";}@e[0,0..21] 

Perl: 203 202 201 198 195 208 203 / 231 chars

 $/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[az]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21] 

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars ( this implementation is 231 chars ):

 $/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[az]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"} 

The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print – four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 – might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now 😉

PS: Reddit says "Hi" 😉

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block – Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct – /^…$/i?1:$x{$ }++ for /^…$/||$x{$ }++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon… perhaps.

Update 4: Sleep deprivation has made me insane. 好。 More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c – you hear me, GolfScript?? I'm coming for you!!! sob

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars… lol. Updated the example to match current output.

Update 6: Removed unneeded braces protecting (…)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

例子:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so |___________________| very |__________________| what 

Alternative implementation in pathological case example:

  _______________________________________________________________ |_______________________________________________________________| she |_______________________________________________________| superlongstringstring |____________________________________________________| said |______________________________________________| alice |________________________________________| was |_____________________________________| that |_______________________________| as |____________________________| her |_________________________| with |_________________________| at |_______________________| on |______________________| all |____________________| this |____________________| for |____________________| had |____________________| but |___________________| be |__________________| not |_________________| they |_________________| so |________________| very |________________| what 

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k , then print results.

 let a= stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1) |>Seq.map(fun s->s.ToLower())|>Seq.countBy id |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w)) |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22 let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min let un=String.replicate(int(float(n)*k)-2)"_" printfn" %s "(u(snd(Seq.nth 0 a))) for(w,n)in a do printfn"|%s| %s "(un)w 

Example (I have different freq counts than you, unsure why):

 % app.exe < Alice.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |___________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| t |____________________________| s |__________________________| on |_________________________| all |_______________________| this |______________________| had |______________________| for |_____________________| but |_____________________| be |____________________| not |___________________| they |__________________| so 

Python 2.6, 347 chars

 import re W,x={},"a and i in is it of or the to".split() [W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[az]+",file("11.txt").read().lower())if w not in x] W=sorted(W.items(),key=lambda p:p[1])[:22] bm=(76.-len(W[0][0]))/W[0][1] U=lambda n:"_"*int(n*bm) print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W)) 

输出:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

*sh (+curl), partial solution

This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:

 curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^az]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^az]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22 

Gawk — 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solution counter challenge! 😉 and [AKX's python][2].

The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.

Minified:

 {gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++} END{split("the and of to ai it in or is",b," "); for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e} for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2; t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t; print"|"t"| "x;delete a[x]}} 

line breaks for clarity only: they are not necessary and should not be counted.


输出:

 $ gawk -f wordfreq.awk.min < 11.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |____________________________________________________| alice |______________________________________________| was |__________________________________________| that |___________________________________| as |_______________________________| her |____________________________| with |____________________________| at |___________________________| s |___________________________| t |_________________________| on |_________________________| all |______________________| this |______________________| for |______________________| had |_____________________| but |____________________| be |____________________| not |___________________| they |__________________| so $ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min ______________________________________________________________________ |______________________________________________________________________| she |_____________________________________________________________| superlongstring |__________________________________________________________| said |__________________________________________________| alice |____________________________________________| was |_________________________________________| that |_________________________________| as |______________________________| her |___________________________| with |___________________________| at |__________________________| s |__________________________| t |________________________| on |________________________| all |_____________________| this |_____________________| for |_____________________| had |____________________| but |___________________| be |___________________| not |__________________| they |_________________| so 

Readable; 633 characters (originally 949):

 { gsub("[^a-zA-Z]"," "); for(;NF;NF--) a[tolower($NF)]++ } END{ # remove "short" words split("the and of to ai it in or is",b," "); for (w in b) delete a[b[w]]; # Find the bar ratio d=1; for (w in a) { e=a[w]/(78-length(w)); if (e>d) d=e } # Print the entries highest count first for (i=22; i; --i){ # find the highest count e=0; for (w in a) if (a[w]>e) e=a[x=w]; # Print the bar l=a[x]/d-2; # make a string of "_" the right length t=sprintf(sprintf("%%%dc",l)," "); gsub(" ","_",t); if (i==22) print" "t; print"|"t"| "x; delete a[x] } } 

Common LISP, 670 characters

I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

 (flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c( make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda (kv)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test 'equal))(push(cons kv)y)))c)(setf y(sort y #'> :key #'cdr))(setf y (subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(- 76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* nf))) (write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline) (dolist(xy)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x)))))) (cond((char<= #\ax #\z)(push xw))(t(incf(gethash(concatenate 'string( reverse w))c 0))(setf w nil))))) 

can be run on for example with cat alice.txt | clisp -C golf.lisp .

In readable form is

 (flet ((r () (let ((x (read-char t nil))) (and x (char-downcase x))))) (do ((c (make-hash-table :test 'equal)) ; the word count map wy ; current word and final word list (x (r) (r))) ; iteration over all chars ((not x) ; make a list with (word . count) pairs removing stopwords (maphash (lambda (kv) (if (not (find k '("" "the" "and" "of" "to" "a" "i" "it" "in" "or" "is") :test 'equal)) (push (cons kv) y))) c) ; sort and truncate the list (setf y (sort y #'> :key #'cdr)) (setf y (subseq y 0 (min (length y) 22))) ; find the scaling factor (let ((f (apply #'min (mapcar (lambda (x) (/ (- 76.0 (length (car x))) (cdr x))) y)))) ; output (flet ((outx (n) (dotimes (i (floor (* nf))) (write-char #\_)))) (write-char #\Space) (outx (cdar y)) (write-char #\Newline) (dolist (xy) (write-char #\|) (outx (cdr x)) (format t "| ~a~%" (car x)))))) ; add alphabetic to current word, and bump word counter ; on non-alphabetic (cond ((char<= #\ax #\z) (push xw)) (t (incf (gethash (concatenate 'string (reverse w)) c 0)) (setf w nil))))) 

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

It does not handle failures and it does not release used memory.

 #include <glib.h> #define S(X)g_string_##X #define H(X)g_hash_table_##X GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);} 

Perl, 185 char

200 (slightly broken) 199 197 195 193 187 185 characters. Last two newlines are significant. Complies with the spec.

 map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[az]+/gfor<>; $n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21]; die map{$U='_'x($X{$_}/$n);" $U "x!$z++,"|$U| $_ "}@w 

First line loads counts of valid words into %X .

The second line computes minimum scaling factor so that all output lines will be <= 80 characters.

The third line (contains two newline characters) produces the output.

Java – 886 865 756 744 742 744 752 742 714 680 chars

  • Updates before first 742 : improved regex, removed superfluous parameterized types, removed superfluous whitespace.

  • Update 742 > 744 chars : fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code ( \\s in regex replaced by and ArrayList replaced by Vector ). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.

  • Update 744 > 752 chars : I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.

  • Update 752 > 742 chars : I removed public and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.

  • Update 742 > 714 chars : Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).

  • Update 714 > 680 chars : Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll() .


 import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}} 

更可读的版本:

 import java.util.*; class F{ public static void main(String[]a)throws Exception{ StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c)); final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1); List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}}); int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s); for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w); } } 

输出:

  _________________________________________________________________________
|_________________________________________________________________________| 她
|_______________________________________________________________| 您
|____________________________________________________________| said
|_____________________________________________________| 爱丽丝
|_______________________________________________| 是
|___________________________________________| 那
|____________________________________| 如
|________________________________| 她的
|_____________________________| 同
|_____________________________| 在
|__________________________| 上
|__________________________| 所有
|_______________________| 这个
|_______________________| 对于
|_______________________| 有
|_______________________| 但
|______________________| 是
|_____________________| 不
|____________________| 他们
|____________________| 所以
|___________________| very
|__________________| what

It pretty sucks that Java doesn't have String#join() and closures (yet).

Edit by Rotsor:

I have made several changes to your solution:

  • Replaced List with a String[]
  • Reused the 'args' argument instead of declaring my own String array. Also used it as an argument to .ToArray()
  • Replaced StringBuffer with a String (yes, yes, terrible performance)
  • Replaced Java sorting with a selection-sort with early halting (only first 22 elements have to be found)
  • Aggregated some int declaration into a single statement
  • Implemented the non-cheating algorithm finding the most limiting line of output. Implemented it without FP.
  • Fixed the problem of the program crashing when there were less than 22 distinct words in the text
  • Implemented a new algorithm of reading input, which is fast and only 9 characters longer than the slow one.

The condensed code is 688 711 684 characters long:

 import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}} 

The fast version ( 720 693 characters)

 import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}} 

更可读的版本:

 import java.util.*;class F{public static void main(String[]l)throws Exception{ Map<String,Integer>m=new HashMap();String w=""; int i=0,k=0,j=8,x,y,g=22; for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{ if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w=""; }} l=m.keySet().toArray(l);x=l.length;if(x<g)g=x; for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;} for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}} String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_'); System.out.println(" "+s); for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}} } 

The version without behaviour improvements is 615 characters:

 import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i<g;++i)for(j=i;++j<l.length;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}i=76-l[0].length();String s=new String(new char[i]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/m.get(l[0]))+"| "+w);}}} 

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

Now as a script ( a.scala ):

 val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22 def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toInt println(" "+b(t(0)._2)) for(p<-t)printf("|%s| %s \n",b(p._2),p._1) 

Run with

 scala -howtorun:script a.scala alice.txt 

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).

Clojure 282 strict

 (let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[kv]s](p \| v \| k))) 

Somewhat more legibly:

 (let[[[_ m]:as s](->> (slurp *in*) .toLowerCase (re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)") frequencies (sort-by val >) (take 22)) [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s)) p #(do (print %1) (dotimes[_(* b %2)] (print \_)) (apply println %&))] (p " " m) (doseq[[kv] s] (p \| v \| k))) 

Scala, 368 chars

First, a legible version in 592 characters:

 object Alice { def main(args:Array[String]) { val s = io.Source.fromFile(args(0)) val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase) val freqs = words.foldLeft(Map[String, Int]())((countmap, word) => countmap + (word -> (countmap.getOrElse(word, 0)+1))) val sortedFreqs = freqs.toList.sort((a, b) => a._2 > b._2) val top22 = sortedFreqs.take(22) val highestWord = top22.head._1 val highestCount = top22.head._2 val widest = 76 - highestWord.length println(" " + "_" * widest) top22.foreach(t => { val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt println("|" + "_" * width + "| " + t._1) }) } } 

The console output looks like this:

 $ scalac alice.scala $ scala Alice aliceinwonderland.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |____________________________________________| that |____________________________________| as |_________________________________| her |______________________________| at |______________________________| with |_____________________________| s |_____________________________| t |___________________________| on |__________________________| all |_______________________| had |_______________________| but |______________________| be |______________________| not |____________________| they |____________________| so |___________________| very |___________________| what 

We can do some aggressive minifying and get it down to 415 characters:

 object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}} 

The console session looks like this:

 $ scalac a.scala $ scala A aliceinwonderland.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |____________________________________________| that |____________________________________| as |_________________________________| her |______________________________| at |______________________________| with |_____________________________| s |_____________________________| t |___________________________| on |__________________________| all |_______________________| had |_______________________| but |______________________| be |______________________| not |____________________| they |____________________| so |___________________| very |___________________| what 

I'm sure a Scala expert could do even better.

Update: In the comments Thomas gave an even shorter version, at 368 characters:

 object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}} 

Legibly, at 375 characters:

 object Alice { def main(a:Array[String]) { val t = (Map[String, Int]() /: ( for ( x <- io.Source.fromFile(a(0)).getLines y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x) ) yield y.toLowerCase ).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22) val w = 76 - t.head._1.length print (" "+"_"*w) t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print) } } 

Java – 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"


Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

I envy C# and LINQ so much.

 import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^az]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(mi)+" ");}}} 

"Readable":

 import java.util.*; import java.io.*; import static java.util.regex.Pattern.*; class g { public static void main(String[] a)throws Exception { PrintStream o = System.out; Map<String,Integer> w = new HashMap(); Scanner s = new Scanner(new File(a[0])) .useDelimiter(compile("[^az]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2)); while(s.hasNext()) { String z = s.next().trim().toLowerCase(); if(z.equals("")) continue; w.put(z,(w.get(z) == null?0:w.get(z))+1); } List<Integer> v = new Vector(w.values()); Collections.sort(v); List<String> q = new Vector(); int i,m; i = m = v.size()-1; while(q.size()<22) { for(String t:w.keySet()) if(!q.contains(t)&&w.get(t).equals(v.get(i))) q.add(t); i--; } int r = 80-q.get(0).length()-4; String l = String.format("%1$0"+r+"d",0).replace("0","_"); o.println(" "+l); o.println("|"+l+"| "+q.get(0)+" "); for(i = m-1; i > m-22; i--) { o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(mi)+" "); } } } 

Output of Alice:

  _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |_____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |____________________________________________| that |____________________________________| as |_________________________________| her |______________________________| with |______________________________| at |___________________________| on |__________________________| all |________________________| this |________________________| for |_______________________| had |_______________________| but |______________________| be |______________________| not |____________________| they |____________________| so |___________________| very |___________________| what 

Output of Don Quixote (also from Gutenberg):

  ________________________________________________________________________ |________________________________________________________________________| that |________________________________________________________| he |______________________________________________| for |__________________________________________| his |________________________________________| as |__________________________________| with |_________________________________| not |_________________________________| was |________________________________| him |______________________________| be |___________________________| don |_________________________| my |_________________________| this |_________________________| all |_________________________| they |________________________| said |_______________________| have |_______________________| me |______________________| on |______________________| so |_____________________| you |_____________________| quixote