I want to create a visualization of frequently used words between 'my' and 'my gf' on Facebook. I downloaded all messages directly from FB in a JSON file and I got the counter working
BUT:
sender_name" or timestamps which are 13 digit numbers\u00c5, \u0082a, \u00c5, \u0082a hardcoded into the wordsHow do I exclude short meaningless words like 'you, I, a, but' etc?
For the first problem I tried creating a dictionary of words to exclude but I have no idea how to even approach excluding them. Also, the problem is with deleting the timestamp numbers because they are not constant.
For the second problem I tried just opening the file in a word editor and replacing the symbol codes but it crashes every time because of the size of the file (more than 1,5 million lines).
Here's the code that I used to print most frequent words:
import re
import collections
import json
file = open('message.json', encoding="utf8")
a = file.read()
words = re.findall(r'\w+', a)
most_common = collections.Counter(map(str.lower, words)).most_common(50)
print(most_common)
And JSON file structure looks like this:
{
"sender_name": "xxxxxx",
"timestamp_ms": 1540327935616,
"content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
"type": "Generic"
},
The problem is that you are using findall over the whole file, do something like this:
import re
import collections
import json
def words(s):
return re.findall('\w+', s, re.UNICODE | re.IGNORECASE)
file = open('message.json', encoding="utf8")
data = json.load(file)
counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', ''))))
most_common = counts.most_common(50)
print(most_common)
Output
[('siä', 1), ('ci', 1), ('podobajä', 1)]
The output is for a file with the following content (a list of JSON objects):
[{
"sender_name": "xxxxxx",
"timestamp_ms": 1540327935616,
"content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
"type": "Generic"
}]
Explanation
With json.load load the content of the file as a list of dictionaries data, then iterate over the elements of the dictionary and count the words of the 'content' field using the function words and Counter
Further
UPDATE
Given the format of the file you need to alter the line: data = json.load(file) to data = json.load(file)["messages"], for the following content:
{
"participants":[],
"messages": [
{
"sender_name": "xxxxxx",
"timestamp_ms": 1540327935616,
"content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
"type": "Generic"
},
{
"sender_name": "aaa",
"timestamp_ms": 1540329382942,
"content": "aaa",
"type": "Generic"
},
{
"sender_name": "aaa",
"timestamp_ms": 1540329262248,
"content": "aaa",
"type": "Generic"
}
]
}
The output is:
[('aaa', 2), ('siä', 1), ('podobajä', 1), ('ci', 1)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With