Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build an ASCII chart of the most commonly used words in a given text [closed]

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.

labVIEW code

results

The program flows from left to right:

labVIEW code explained


Ruby 1.9, 185 chars

(heavily based on the other Ruby solutions)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt)

Since I'm using character-literals here, this solution only works in Ruby 1.9.

Edit: Replaced semicolons by line breaks for "readability". :P

Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

Edit 3: Removed the trailing space again ;)


GolfScript, 177 175 173 167 164 163 144 131 130 chars

Slow - 3 minutes for the sample text (130)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"
|"\~1*2/0*'| '@}/

Explanation:

{           #loop through all characters
 32|.       #convert to uppercase and duplicate
 123%97<    #determine if is a letter
 n@if       #return either the letter or a newline
}%          #return an array (of ints)
]''*        #convert array to a string with magic
n%          #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa"   #push this string
2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
-           #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$           #sort the array of words
(1@         #takes the first word in the array, pushes a 1, reorders stack
            #the 1 is the current number of occurrences of the first word
{           #loop through the array
 .3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/         #gather stack into an array and split into groups of 2
{~~\;}$     #sort by the latter element - the count of occurrences of each word
22<         #take the first 22 elements
.0=~:2;     #store the highest count
,76\-:1     #store the length of the first line
'_':0*' '\@ #make the first line
{           #loop through each word
"
|"\~        #start drawing the bar
1*2/0       #divide by zero
*'| '@      #finish drawing the bar
}/

"Correct" (hopefully). (143)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"
|"\~1*^/0*'| '@}/

Less slow - half a minute. (162)

'"'/' ':S*n/S*'"#{%q
'\+"
.downcase.tr('^a-z','
')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"
|"\~1*2/0*'| '@}/

Output visible in revision logs.


206

shell, grep, tr, grep, sort, uniq, sort, head, perl

~ % wc -c wfg
209 wfg
~ % cat wfg
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
~ % # usage:
~ % sh wfg < 11.txt

hm, just seen above: sort -nr -> sort -n and then head -> tail => 208 :)
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'



for fun, here's a perl-only version (much faster):

~ % wc -c pgolf
204 pgolf
~ % cat pgolf
perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'
~ % # usage:
~ % sh pgolf < 11.txt

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Thanks to Gabe for some useful suggestions to reduce the character count.

NB: Line breaks added to avoid scrollbars only the last line break is required.

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',
SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING
(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D
FROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C
INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH
(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it',
'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+
REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

Readable Version

DECLARE @  VARCHAR(MAX),
        @F REAL
SELECT @=BulkColumn
FROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path
                                             C:\WINDOWS\system32\A  */

/*Recursive common table expression to
generate a table of numbers from 1 to string length
(and associated characters)*/
WITH N AS
     (SELECT 1 i,
             LEFT(@,1)L

     UNION ALL

     SELECT i+1,
            SUBSTRING(@,i+1,1)
     FROM   N
     WHERE  i<LEN(@)
     )
  SELECT   i,
           L,
           i-RANK()OVER(ORDER BY i)R
           /*Will group characters
           from the same word together*/
  INTO     #D
  FROM     N
  WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)
             /*Assuming case insensitive accent sensitive collation*/

SELECT   TOP 22 W,
         -COUNT(*)C
INTO     #
FROM     (SELECT DISTINCT R,
                          (SELECT ''+L
                          FROM    #D
                          WHERE   R=b.R FOR XML PATH('')
                          )W
                          /*Reconstitute the word from the characters*/
         FROM             #D b
         )
         T
WHERE    LEN(W)>1
AND      W NOT IN('the',
                  'and',
                  'of' ,
                  'to' ,
                  'it' ,
                  'in' ,
                  'or' ,
                  'is')
GROUP BY W
ORDER BY C

/*Just noticed this looks risky as it relies on the order of evaluation of the 
 variables. I'm not sure that's guaranteed but it works on my machine :-) */
SELECT @F=MIN(($76-LEN(W))/-C),
       @ =' '      +REPLICATE('_',-MIN(C)*@F)+' '
FROM   #

SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W
             FROM     #
             ORDER BY C

PRINT @

Output

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| You
|____________________________________________________________| said
|_____________________________________________________| Alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| This
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| So
|___________________| very
|__________________| what

And with the long string

 _______________________________________________________________ 
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| Alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| at
|_________________________| with
|_______________________| on
|______________________| all
|____________________| This
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| So
|________________| very
|________________| what