Python Collections.counter and excluding stuff from JSON

Question

I want to create a visualization of frequently used words between 'my' and 'my gf' on Facebook. I downloaded all messages directly from FB in a JSON file and I got the counter working

BUT:

Counter also counts element names from JSON like "sender_name" or timestamps which are 13 digit numbers
The JSON file is lacking UTF encoding - I have strings like \u00c5, \u0082a, \u00c5, \u0082a hardcoded into the words

How do I exclude short meaningless words like 'you, I, a, but' etc?

For the first problem I tried creating a dictionary of words to exclude but I have no idea how to even approach excluding them. Also, the problem is with deleting the timestamp numbers because they are not constant.

For the second problem I tried just opening the file in a word editor and replacing the symbol codes but it crashes every time because of the size of the file (more than 1,5 million lines).

Here's the code that I used to print most frequent words:

import re
import collections
import json

file = open('message.json', encoding="utf8")
a = file.read()

words = re.findall(r'\w+', a)

most_common = collections.Counter(map(str.lower, words)).most_common(50)
print(most_common)

And JSON file structure looks like this:

{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },

Dani Mesejo · Accepted Answer

The problem is that you are using findall over the whole file, do something like this:

import re
import collections
import json


def words(s):
    return re.findall('\w+', s, re.UNICODE | re.IGNORECASE)

file = open('message.json', encoding="utf8")
data = json.load(file)

counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', ''))))
most_common = counts.most_common(50)
print(most_common)

Output

[('siä', 1), ('ci', 1), ('podobajä', 1)]

The output is for a file with the following content (a list of JSON objects):

[{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
}]

Explanation

With json.load load the content of the file as a list of dictionaries data, then iterate over the elements of the dictionary and count the words of the 'content' field using the function words and Counter

Further

For removing words such as I, a and but see this

UPDATE

Given the format of the file you need to alter the line: data = json.load(file) to data = json.load(file)["messages"], for the following content:

{
  "participants":[],
  "messages": [
    {
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329382942,
      "content": "aaa",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329262248,
      "content": "aaa",
      "type": "Generic"
    }
  ]
}

The output is:

[('aaa', 2), ('siä', 1), ('podobajä', 1), ('ci', 1)]

Python Collections.counter and excluding stuff from JSON

Tags:

python

python-3.x

Marsin Ka

1 Answers

Dani Mesejo

Recent Activity

Donate For Us

Python Collections.counter and excluding stuff from JSON

Tags:

python

python-3.x

Marsin Ka

1 Answers

Dani Mesejo

Related questions

Recent Activity

Donate For Us