Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Extract hashtags out of a text file

Tags:

python

hashtag

So, I've written the code below to extract hashtags and also tags with '@', and then append them to a list and sort them in descending order. The thing is that the text might not be perfectly formatted and not have spaces between each individual hashtag and the following problem may occur - as it may be checked with the #print statement inside the for loop : #socality#thisismycommunity#themoderndayexplorer#modernoutdoors#mountaincultureelevated

So, the .split() method doesn't deal with those. What would be the best practice to this issue?

Here is the .txt file

Grateful for your time.

name = input("Enter file:")
if len(name) < 1 : name = "tags.txt"
handle = open(name)
tags = dict()
lst = list()

for line in handle :
    hline = line.split()
    for word in hline:
        if word.startswith('@') : tags[word] = tags.get(word,0) + 1
        else :
            tags[word] = tags.get(word,0) + 1
        #print(word)

for k,v in tags.items() :
    tags_order = (v,k)
    lst.append(tags_order)

lst = sorted(lst, reverse=True)[:34]
print('Final Dictionary: ' , '\n')
for v,k in lst :
    print(k , v, '')
like image 463
Rui Torres Avatar asked Mar 06 '23 23:03

Rui Torres


1 Answers

Use a regular expression. There are only a few limits; a tag must start with either # or @, and it may not contain any spaces or other whitespace characters.

This code

import re
tags = []
with open('../Downloads/tags.txt','Ur') as file:
    for line in f.readline():
        tags += re.findall(r'[#@][^\s#@]+', line)

creates a list of all tags in the file. You can easily adjust it to store the found tags in your dictionary; instead of storing the result straight away in tags, loop over it and do with each item as you please.

The regex is built up from these two custom character classes:

  • [#@] - either the single character # or @ at the start
  • [^\s#@]+ - a sequence of not any single whitespace character (\s matches all whitespace such as space, tab, and returns), #, or @; at least one, and as many as possible.

So findall starts matching at the start of any tag and then grabs as much as it can, stopping only when encountering any of the "not" characters.

findall returns a list of matching items, which you can immediately add to an existing list, or loop over the found items in turn:

for tag in re.findall(r'[#@][^\s#@]+', line):
    # process "tag" any way you want here

The source text file contains Windows-style \r\n line endings, and so I initially got a lot of empty "lines" on my Mac. Opening the text file in Universal newline mode makes sure that is handled transparently by the line reading part of Python.

like image 91
Jongware Avatar answered Mar 30 '23 05:03

Jongware