I'm working on twitter hashtags and I've already counted the number of times they appear in my csv file. My csv file look like:
GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10
Now, I would like to group together 2 terms that are close, such as "GilletsJaunes" and "gilletsjaune" using the fuzzywuzzy library. If the proximity between the 2 terms is greater than 80, then their value is added in only one of the 2 terms and the other is deleted. This would give:
GilletsJaunes, 120
Macron, 50
tax, 10
For use "fuzzywuzzy":
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output
First, copy these two functions to be able to compute the argmax:
# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
return max(pairs, key=lambda x: x[1])[0]
# given an iterable of values return the index of the greatest value
def argmax_index(values):
return argmax(enumerate(values))
Second, load the content of your CSV into a Python dictionary and proceed as follows:
from fuzzywuzzy import fuzz
input = {
'GilletsJaunes': 100,
'Macron': 50,
'gilletsjaune': 20,
'tax': 10,
}
threshold = 50
output = dict()
for query in input:
references = list(output.keys()) # important: this is output.keys(), not input.keys()!
scores = [fuzz.ratio(query, ref) for ref in references]
if any(s > threshold for s in scores):
best_reference = references[argmax_index(scores)]
output[best_reference] += input[query]
else:
output[query] = input[query]
print(output)
{'GilletsJaunes': 120, 'Macron': 50, 'tax': 10}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With