Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining Word Frequency of Specific Terms

I'm a non-computer science student doing a history thesis that involves determining the frequency of specific terms in a number of texts and then plotting these frequencies over time to determine changes and trends. While I have figured out how to determine word frequencies for a given text file, I am dealing with a (relatively, for me) large number of files (>100) and for consistencies sake would like to limit the words included in the frequency count to a specific set of terms (sort of like the opposite of a "stop list")

This should be kept very simple. At the end all I need to have is the frequencies for the specific words for each text file I process, preferably in spreadsheet format (tab delineated file) so that I can then create graphs and visualizations using that data.

I use Linux day-to-day, am comfortable using the command line, and would love an open-source solution (or something I could run with WINE). That is not a requirement however:

I see two ways to solve this problem:

  1. Find a way strip-out all the words in a text file EXCEPT for the pre-defined list and then do the frequency count from there, or:
  2. Find a way to do a frequency count using just the terms from the pre-defined list.

Any ideas?

like image 702
fdsayre Avatar asked Nov 24 '08 22:11

fdsayre


People also ask

How do I find the frequency of words in word?

All you need to do is copy your entire document and paste it into NoteTab. Then, within NoteTab, choose Tools | Text Statistics | More. It presents an analysis of the word frequency, including percentages.

What is word frequency method?

The word frequency effect is a psychological phenomenon where recognition times are faster for words seen more frequently than for words seen less frequently. Word frequency depends on individual awareness of the tested language.

How do you find the most frequently used words in a document?

WordCounter is a web tool where you can cut and paste a body of text to the text boxes and counts the most frequently used words in the text. This is a quick tool for making sure the text does not contains any overuse words. Wordcounter ranks the most frequently used words in any given body of text.

How do you find the frequency of a word in Python?

Use set() method to remove a duplicate and to give a set of unique words. Iterate over the set and use count function (i.e. string. count(newstring[iteration])) to find the frequency of word at each iteration.


1 Answers

I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.

#!/usr/bin/perl

use strict;
use warnings;

my $word_list_file = shift;
my $process_file = shift;

my %word_counts;

# Open the word list file, read a line at a time, remove the newline,
# add it to the hash of words to track, initialize the count to zero
open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
while (<WORDS>) {
  chomp;
  # Store words in lowercase for case-insensitive match
  $word_counts{lc($_)} = 0;
}
close(WORDS);

# Read the text file one line at a time, break the text up into words
# based on word boundaries (\b), iterate through each word incrementing
# the word count in the word hash if the word is in the hash
open(FILE, $process_file) or die "Failed to open process file: $!\n";

while (<FILE>) {
  chomp;
  while ( /-$/ ) {
    # If the line ends in a hyphen, remove the hyphen and
    # continue reading lines until we find one that doesn't
    chop;
    my $next_line = <FILE>;
    defined($next_line) ? $_ .= $next_line : last;
  }

  my @words = split /\b/, lc; # Split the lower-cased version of the string
  foreach my $word (@words) {
    $word_counts{$word}++ if exists $word_counts{$word};
  }
}
close(FILE);

# Print each word in the hash in alphabetical order along with the
# number of time encountered, delimited by tabs (\t)
foreach my $word (sort keys %word_counts)
{
  print "$word\t$word_counts{$word}\n"
}

If the file words.txt contains:

linux
frequencies
science
words

And the file text.txt contains the text of your post, the following command:

perl analyze.pl words.txt text.txt

will print:

frequencies     3
linux   1
science 1
words   3

Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.

Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.

Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won't find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:

s/-//g;
like image 164
Robert Gamble Avatar answered Sep 29 '22 07:09

Robert Gamble