I want to take every word from a text file, and count the word frequency in a dictionary.
Example: 'this is the textfile, and it is used to take words and count'
d = {'this': 1, 'is': 2, 'the': 1, ...}
I am not that far, but I just can't see how to complete it. My code so far:
import sys
argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
#there should be a counter and somehow it must fill the dict.
from collections import Counter
t = 'this is the textfile, and it is used to take words and count'
dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}
Or better with removing punctuation before counting:
dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}
If you don't want to use collections.Counter, you can write your own function:
import sys
filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()
unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
word = raw_word.strip(unwanted_chars)
if word not in wordfreq:
wordfreq[word] = 0
wordfreq[word] += 1
for finer things, look at regular expressions.
Although using Counter
from the collections
library as suggested by @Michael is a better approach, I am adding this answer just to improve your code. (I believe this will be a good answer for a new Python learner.)
From the comment in your code it seems like you want to improve your code. And I think you are able to read the file content in words (while usually I avoid using read()
function and use for line in file_descriptor:
kind of code).
As words
is a string, in for loop, for i in words:
the loop-variable i
is not a word but a char. You are iterating over chars in the string instead of iterating over words in the string words
. To understand this, notice following code snippet:
>>> for i in "Hi, h r u?":
... print i
...
H
i
,
h
r
u
?
>>>
Because iterating over the given string char by chars instead of word by words is not what you wanted to achieve, to iterate words by words you should use the split
method/function from string class in Python.str.split(str="", num=string.count(str))
method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
Notice the code examples below:
Split:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']
loop with split:
>>> for i in "Hi, how are you?".split():
... print i
...
Hi,
how
are
you?
And it looks like something you need. Except for word Hi,
because split()
, by default, splits by whitespaces so Hi,
is kept as a single string (and obviously) you don't want that.
To count the frequency of words in the file, one good solution is to use regex. But first, to keep the answer simple I will be using replace()
method. The method str.replace(old, new[, max])
returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.
Now check code example below to see what I suggested:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split
loop:
>>> for word in "Hi, how are you?".replace(',', ' ').split():
... print word
...
Hi
how
are
you?
Now, how to count frequency:
One way is use Counter
as @Michael suggested, but to use your approach in which you want to start from empty an dict. Do something like this code sample below:
words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
# ^^ add 1 to 0 or old value from dict
What am I doing? Because initially wordfreq
is empty you can't assign it to wordfreq[word]
for the first time (it will raise key exception error). So I used setdefault
dict method.
dict.setdefault(key, default=None)
is similar to get()
, but will set dict[key]=default
if key is not already in dict. So for the first time when a new word comes, I set it with 0
in dict using setdefault
then add 1
and assign to the same dict.
I have written an equivalent code using with open instead of single open
.
with open('~/Desktop/file') as f:
words = f.read()
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
That runs like this:
$ cat file # file is
this is the textfile, and it is used to take words and count
$ python work.py # indented manually
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2,
'it': 1, 'to': 1, 'take': 1, 'words': 1,
'the': 1, 'textfile': 1}
Using re.split(pattern, string, maxsplit=0, flags=0)
Just change the for loop: for i in re.split(r"[,\s]+", words):
, that should produce the correct output.
Edit: better to find all alphanumeric character because you may have more than one punctuation symbols.
>>> re.findall(r'[\w]+', words) # manually indent output
['this', 'is', 'the', 'textfile', 'and',
'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']
use for loop as: for word in re.findall(r'[\w]+', words):
How would I write code without using read()
:
File is:
$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.
Code is:
$ cat work.py
import re
wordfreq = {}
with open('file') as f:
for line in f:
for word in re.findall(r'[\w]+', line.lower()):
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
Used lower()
to convert an upper letter to lower letter.
output:
$python work.py # manually strip output
{'and': 3, 'letters': 1, 'text': 1, 'is': 3,
'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1,
'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1,
'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1,
'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2,
'lines': 1, 'can': 1, 'the': 1}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With