I am doing textual analysis on texts that due to PDF-to-txt conversion errors, sometime lump words together. So instead of matching words, I want to match strings.
For example, I have the string:
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
And I search for
key_words=['loss', 'debt', 'debts', 'elephant']
The output should be of the form:
Filename Debt Debts Loss Elephant
mystring 2 1 1 0
The code I have works well, except for a few glitches: 1) it does not report the frequency of zero-frequency words (so 'Elephant' would not be in the output: 2) the order of the words in key_words seems to matter (ie. I sometimes get 1 count each for 'debt' and 'debts', and sometimes it reports only 2 counts for 'debt', and 'debts is not reported. I could live with the second point if I managed to "print" the variable names to the dataset... but not sure how.
Below is the relevant code. Thanks! PS. Needless to say, it is not the most elegant piece of code, but I am slowly learning.
bad=set(['debts', 'debt'])
csvfile=open("freq_10k_test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):
with open(filename, encoding='utf-8', errors='ignore') as f:
file_name=[]
file_name.append(filename)
new_review=[f.read()]
freq_all=[]
rev=[]
from collections import Counter
for review in new_review:
review_processed=review.lower()
for p in list(punctuation):
review_processed=review_processed.replace(p,'')
pattern = re.compile("|".join(bad), flags = re.IGNORECASE)
freq_iter=collections.Counter(pattern.findall(review_processed))
frequency=[value for (key,value) in sorted(freq_iter.items())]
freq_all.append(frequency)
freq=[v for v in freq_all]
fulldata = [ [file_name[i]] + freq for i, freq in enumerate(freq)]
writer=csv.writer(open("freq_10k_test.csv",'a',newline='', encoding='cp850', errors='replace'))
writer.writerows(fulldata)
csvfile.flush()
You can just pre-initialize the counter, something like this:
freq_iter = collections.Counter()
freq_iter.update({x:0 for x in bad})
freq_iter.update(pattern.findall(review_processed))
One nice thing about Counter
is that you don't actually have to pre-initialize it - you can just do c = Counter(); c['key'] += 1
, but nothing prevents you from pre-initializing some values to 0 if you want.
For the debt
/debts
thing - that is just an insufficiently specified problem. What do you want the code to do in that case? If you want it to match on the longest pattern matched, you need to sort the list longest-first, that will solve it. If you want both reported, you may need to do multiple searches and save all the results.
Updated to add some information on why it can't find debts
: That has more to do with the regex findall than anything else. re.findall
always looks for the shortest match, but also once it finds one, it doesn't include it in subsequent matches:
In [2]: re.findall('(debt|debts)', 'debtor debts my debt')
Out[2]: ['debt', 'debt', 'debt']
If you really want to find all instances of every word, you need to do them separately:
In [3]: re.findall('debt', 'debtor debts my debt')
Out[3]: ['debt', 'debt', 'debt']
In [4]: re.findall('debts', 'debtor debts my debt')
Out[4]: ['debts']
However, maybe what you are really looking for is words. in this case, use the \b
operator to require a word break:
In [13]: re.findall(r'\bdebt\b', 'debtor debts my debt')
Out[13]: ['debt']
In [14]: re.findall(r'(\b(?:debt|debts)\b)', 'debtor debts my debt')
Out[14]: ['debts', 'debt']
I don't know whether this is what you want or not... in this case, it was able to differentiate debt
and debts
correctly, but it missed debtor
because it only matches a substring, and we asked it not to.
Depending on your use case, you may want to look into stemming the text... I believe there is one in nltk that is pretty simple (used it only once, so I won't try to post an example... this question Combining text stemming and removal of punctuation in NLTK and scikit-learn may be useful), it should reduce debt
, debts
, and debtor
all to the same root word debt
, and do similar things for other words. This may or may not be helpful; I don't know what you are doing with it.
Like you want :
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
for kw in key_words:
count = mystring.count(kw)
print('%s %s' % (kw, count))
Or for words:
from collections import defaultdict
words = set(mystring.split())
key_words=['loss', 'debt', 'debts', 'elephant']
d = defaultdict(int)
for word in words:
d[word] += 1
for kw in key_words:
print('%s %s' % (kw, d[kw]))
A sleek solution is to use regex:
import regex
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
print ({k:len(regex.findall(k,mystring,overlapped=True)) for k in key_words})
results to:
{'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With