What are the elegant and effective ways to count the frequency of each "english" word in a file?
First of all, I define letter_only
std::locale
so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways"
, "ways."
and "ways!"
as just the same word "ways"
, because the stream will ignore punctuations like "."
and "!"
.
struct letter_only: std::ctype<char>
{
letter_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
return &rc[0];
}
};
int main()
{
std::map<std::string, int> wordCount;
ifstream input;
input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
input.open("filename.txt");
std::string word;
while(input >> word)
{
++wordCount[word];
}
for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
{
cout << it->first <<" : "<< it->second << endl;
}
}
struct Counter
{
std::map<std::string, int> wordCount;
void operator()(const std::string & item) { ++wordCount[item]; }
operator std::map<std::string, int>() { return wordCount; }
};
int main()
{
ifstream input;
input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
input.open("filename.txt");
istream_iterator<string> start(input);
istream_iterator<string> end;
std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
{
cout << it->first <<" : "<< it->second << endl;
}
}
Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files
In a nutshell,
1) If needed, strip punctuation and convert uppercase to lowercase:perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file
2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq
I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.
Here is working solution.This should work with real text (including punctuation) :
#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>
std::string getNextToken(std::istream &in)
{
char c;
std::string ans="";
c=in.get();
while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
{
c=in.get();
}
while(std::isalpha(c))
{
ans.push_back(std::tolower(c));
c=in.get();
}
return ans;
}
int main()
{
std::map<std::string,int> words;
std::ifstream fin("input.txt");
std::string s;
std::string empty ="";
while((s=getNextToken(fin))!=empty )
++words[s];
for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
std::cout<<iter->first<<' '<<iter->second<<std::endl;
}
Edit: Now my code calling tolower for every letter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With