Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Collections.counter and excluding stuff from JSON

I want to create a visualization of frequently used words between 'my' and 'my gf' on Facebook. I downloaded all messages directly from FB in a JSON file and I got the counter working

BUT:

  • Counter also counts element names from JSON like "sender_name" or timestamps which are 13 digit numbers
  • The JSON file is lacking UTF encoding - I have strings like \u00c5, \u0082a, \u00c5, \u0082a hardcoded into the words

How do I exclude short meaningless words like 'you, I, a, but' etc?

For the first problem I tried creating a dictionary of words to exclude but I have no idea how to even approach excluding them. Also, the problem is with deleting the timestamp numbers because they are not constant.

For the second problem I tried just opening the file in a word editor and replacing the symbol codes but it crashes every time because of the size of the file (more than 1,5 million lines).

Here's the code that I used to print most frequent words:

import re
import collections
import json

file = open('message.json', encoding="utf8")
a = file.read()

words = re.findall(r'\w+', a)

most_common = collections.Counter(map(str.lower, words)).most_common(50)
print(most_common)

And JSON file structure looks like this:

{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },
like image 407
Marsin Ka Avatar asked Dec 03 '25 23:12

Marsin Ka


1 Answers

The problem is that you are using findall over the whole file, do something like this:

import re
import collections
import json


def words(s):
    return re.findall('\w+', s, re.UNICODE | re.IGNORECASE)

file = open('message.json', encoding="utf8")
data = json.load(file)

counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', ''))))
most_common = counts.most_common(50)
print(most_common)

Output

[('siä', 1), ('ci', 1), ('podobajä', 1)]

The output is for a file with the following content (a list of JSON objects):

[{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
}]

Explanation

With json.load load the content of the file as a list of dictionaries data, then iterate over the elements of the dictionary and count the words of the 'content' field using the function words and Counter

Further

  1. For removing words such as I, a and but see this

UPDATE

Given the format of the file you need to alter the line: data = json.load(file) to data = json.load(file)["messages"], for the following content:

{
  "participants":[],
  "messages": [
    {
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329382942,
      "content": "aaa",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329262248,
      "content": "aaa",
      "type": "Generic"
    }
  ]
}

The output is:

[('aaa', 2), ('siä', 1), ('podobajä', 1), ('ci', 1)]
like image 160
Dani Mesejo Avatar answered Dec 05 '25 12:12

Dani Mesejo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!