So, I've written the code below to extract hashtags and also tags with '@', and then append them to a list and sort them in descending order. The thing is that the text might not be perfectly formatted and not have spaces between each individual hashtag and the following problem may occur - as it may be checked with the #print statement inside the for loop : #socality#thisismycommunity#themoderndayexplorer#modernoutdoors#mountaincultureelevated
So, the .split() method doesn't deal with those. What would be the best practice to this issue?
Here is the .txt file
Grateful for your time.
name = input("Enter file:")
if len(name) < 1 : name = "tags.txt"
handle = open(name)
tags = dict()
lst = list()
for line in handle :
hline = line.split()
for word in hline:
if word.startswith('@') : tags[word] = tags.get(word,0) + 1
else :
tags[word] = tags.get(word,0) + 1
#print(word)
for k,v in tags.items() :
tags_order = (v,k)
lst.append(tags_order)
lst = sorted(lst, reverse=True)[:34]
print('Final Dictionary: ' , '\n')
for v,k in lst :
print(k , v, '')
Use a regular expression. There are only a few limits; a tag must start with either #
or @
, and it may not contain any spaces or other whitespace characters.
This code
import re
tags = []
with open('../Downloads/tags.txt','Ur') as file:
for line in f.readline():
tags += re.findall(r'[#@][^\s#@]+', line)
creates a list of all tags in the file. You can easily adjust it to store the found tags in your dictionary; instead of storing the result straight away in tags
, loop over it and do with each item as you please.
The regex is built up from these two custom character classes:
[#@]
- either the single character #
or @
at the start[^\s#@]+
- a sequence of not any single whitespace character (\s
matches all whitespace such as space, tab, and returns), #
, or @
; at least one, and as many as possible.So findall
starts matching at the start of any tag and then grabs as much as it can, stopping only when encountering any of the "not" characters.
findall
returns a list of matching items, which you can immediately add to an existing list, or loop over the found items in turn:
for tag in re.findall(r'[#@][^\s#@]+', line):
# process "tag" any way you want here
The source text file contains Windows-style \r\n
line endings, and so I initially got a lot of empty "lines" on my Mac. Opening the text file in Universal newline mode makes sure that is handled transparently by the line reading part of Python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With