Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3.5 - Get counter to report zero-frequency items

I am doing textual analysis on texts that due to PDF-to-txt conversion errors, sometime lump words together. So instead of matching words, I want to match strings.

For example, I have the string:

mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'

And I search for

key_words=['loss', 'debt', 'debts', 'elephant']

The output should be of the form:

Filename Debt Debts Loss Elephant
mystring  2    1     1    0

The code I have works well, except for a few glitches: 1) it does not report the frequency of zero-frequency words (so 'Elephant' would not be in the output: 2) the order of the words in key_words seems to matter (ie. I sometimes get 1 count each for 'debt' and 'debts', and sometimes it reports only 2 counts for 'debt', and 'debts is not reported. I could live with the second point if I managed to "print" the variable names to the dataset... but not sure how.

Below is the relevant code. Thanks! PS. Needless to say, it is not the most elegant piece of code, but I am slowly learning.

bad=set(['debts', 'debt'])

csvfile=open("freq_10k_test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):

    with open(filename, encoding='utf-8', errors='ignore') as f:
      file_name=[]
      file_name.append(filename)

      new_review=[f.read()]
      freq_all=[]
      rev=[]

      from collections import Counter

      for review in new_review:
        review_processed=review.lower()
        for p in list(punctuation):
           review_processed=review_processed.replace(p,'')
           pattern = re.compile("|".join(bad), flags = re.IGNORECASE)
           freq_iter=collections.Counter(pattern.findall(review_processed))           

        frequency=[value for (key,value) in sorted(freq_iter.items())]
        freq_all.append(frequency)
        freq=[v for v in freq_all]

    fulldata = [ [file_name[i]] + freq  for i, freq in enumerate(freq)]  

    writer=csv.writer(open("freq_10k_test.csv",'a',newline='', encoding='cp850', errors='replace'))
    writer.writerows(fulldata)
    csvfile.flush()
like image 665
anne_t Avatar asked Jun 29 '17 15:06

anne_t


3 Answers

You can just pre-initialize the counter, something like this:

freq_iter = collections.Counter()
freq_iter.update({x:0 for x in bad})
freq_iter.update(pattern.findall(review_processed))   

One nice thing about Counter is that you don't actually have to pre-initialize it - you can just do c = Counter(); c['key'] += 1, but nothing prevents you from pre-initializing some values to 0 if you want.

For the debt/debts thing - that is just an insufficiently specified problem. What do you want the code to do in that case? If you want it to match on the longest pattern matched, you need to sort the list longest-first, that will solve it. If you want both reported, you may need to do multiple searches and save all the results.

Updated to add some information on why it can't find debts: That has more to do with the regex findall than anything else. re.findall always looks for the shortest match, but also once it finds one, it doesn't include it in subsequent matches:

In [2]: re.findall('(debt|debts)', 'debtor debts my debt')
Out[2]: ['debt', 'debt', 'debt']

If you really want to find all instances of every word, you need to do them separately:

In [3]: re.findall('debt', 'debtor debts my debt')
Out[3]: ['debt', 'debt', 'debt']

In [4]: re.findall('debts', 'debtor debts my debt')
Out[4]: ['debts']

However, maybe what you are really looking for is words. in this case, use the \b operator to require a word break:

In [13]: re.findall(r'\bdebt\b', 'debtor debts my debt')
Out[13]: ['debt']

In [14]: re.findall(r'(\b(?:debt|debts)\b)', 'debtor debts my debt')
Out[14]: ['debts', 'debt']

I don't know whether this is what you want or not... in this case, it was able to differentiate debt and debts correctly, but it missed debtor because it only matches a substring, and we asked it not to.

Depending on your use case, you may want to look into stemming the text... I believe there is one in nltk that is pretty simple (used it only once, so I won't try to post an example... this question Combining text stemming and removal of punctuation in NLTK and scikit-learn may be useful), it should reduce debt, debts, and debtor all to the same root word debt, and do similar things for other words. This may or may not be helpful; I don't know what you are doing with it.

like image 79
Corley Brigman Avatar answered Oct 19 '22 18:10

Corley Brigman


Like you want :

mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
for kw in key_words:
  count = mystring.count(kw)
  print('%s %s' % (kw, count))

Or for words:

from collections import defaultdict
words = set(mystring.split())
key_words=['loss', 'debt', 'debts', 'elephant']
d = defaultdict(int)
for word in words:
  d[word] += 1

for kw in key_words:
  print('%s %s' % (kw, d[kw]))
like image 25
D. Peter Avatar answered Oct 19 '22 17:10

D. Peter


A sleek solution is to use regex:

import regex
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
print ({k:len(regex.findall(k,mystring,overlapped=True)) for k in key_words})

results to:

{'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}
like image 20
ntg Avatar answered Oct 19 '22 16:10

ntg