I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I have been trying thus far. For example, "I love #stackoverflow because #people are very #helpful!" This should pull the 3 hashtags into an array.

A simple regex should do the job: <pre class="prettyprint"><code>>>> import re >>> s = "I love #stackoverflow because #people are very #helpful!" >>> re.findall(r"#(\w+)", s) ['stackoverflow', 'people', 'helpful'] </code></pre> Note though, that as suggested in other answers, this may also find non-hashtags, such as a hash location in a URL: <pre class="prettyprint"><code>>>> re.findall(r"#(\w+)", "http://example.org/#comments") ['comments'] </code></pre> So another simple solution would be the following (removes duplicates as a bonus): <pre class="prettyprint"><code>>>> def extract_hash_tags(s): ... return set(part[1:] for part in s.split() if part.startswith('#')) ... >>> extract_hash_tags("#test http://example.org/#comments #test") set(['test']) </code></pre>

<pre class="prettyprint"><code>>>> s="I love #stackoverflow because #people are very #helpful!" >>> [i for i in s.split() if i.startswith("#") ] ['#stackoverflow', '#people', '#helpful!'] </code></pre>

Suppose that you have to retrieve your <code>#Hashtags</code> from a sentence full of punctuation symbols. Let's say that <code>#stackoverflow #people</code> and <code>#helpful</code>are terminated with different symbols, you want to retrieve them from <code>text</code> but you may want to avoid repetitions: <pre class="prettyprint"><code>>>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!" </code></pre> if you try with <code>set([i for i in text.split() if i.startswith("#")])</code> alone, you will get: <pre class="prettyprint"><code>>>> set(['#helpful???', '#people', '#stackoverflow,', '#stackoverflow', '#helpful!!!', '#helpful!', '#people...']) </code></pre> which in my mind is redundant. Better solution using RE with module <code>re</code>: <pre class="prettyprint"><code>>>> import re >>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])]) >>> set(['#people', '#helpful', '#stackoverflow']) </code></pre> Now it's ok for me. EDIT: UNICODE <code>#Hashtags</code> Add the <code>re.UNICODE</code> flag if you want to delete punctuations, but still preserving letters with accents, apostrophes and other unicode-encoded stuff which may be important if the <code>#Hashtags</code> may be expected not to be only in english... maybe this is only an italian guy nightmare, maybe not! ;-) For example: <pre class="prettyprint"><code>>>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!" </code></pre> will be unicode-encoded as: <pre class="prettyprint"><code>>>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!' </code></pre> and you can retrieve your (correctly encoded) <code>#Hashtags</code> in this way: <pre class="prettyprint"><code>>>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])]) >>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l']) </code></pre> EDITx2: UNICODE <code>#Hashtags</code> and control for <code>#</code> repetitions If you want to control for multiple repetitions of the <code>#</code> symbol, as in (forgive me if the <code>text</code> example has become almost unreadable): <pre class="prettyprint"><code>>>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!" >>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!' </code></pre> then you should substitute these multiple occurrences with a unique <code>#</code>. A possible solution is to introduce another nested implicit <code>set()</code> definition with the <code>sub()</code> function replacing occurrences of more-than-1 <code>#</code> with a single <code>#</code>: <pre class="prettyprint"><code>>>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])]) >>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l']) </code></pre>

Parsing a tweet to extract hashtags into an array

6 Answers

A simple regex should do the job:

>>> import re
>>> s = "I love #stackoverflow because #people are very #helpful!"
>>> re.findall(r"#(\w+)", s)
['stackoverflow', 'people', 'helpful']

Note though, that as suggested in other answers, this may also find non-hashtags, such as a hash location in a URL:

>>> re.findall(r"#(\w+)", "http://example.org/#comments")
['comments']

So another simple solution would be the following (removes duplicates as a bonus):

>>> def extract_hash_tags(s):
...    return set(part[1:] for part in s.split() if part.startswith('#'))
...
>>> extract_hash_tags("#test http://example.org/#comments #test")
set(['test'])

115

answered Nov 03 '22 03:11

AndiDog

>>> s="I love #stackoverflow because #people are very #helpful!"
>>> [i  for i in s.split() if i.startswith("#") ]
['#stackoverflow', '#people', '#helpful!']

answered Nov 03 '22 02:11

ghostdog74

The best Twitter hashtag regular expression:

import re
text = "#promovolt #1st # promovolt #123"
re.findall(r'\B#\w*[a-zA-Z]+\w*', text)

>>> ['#promovolt', '#1st']

enter image description here

answered Nov 03 '22 03:11

korniichuk

Suppose that you have to retrieve your #Hashtags from a sentence full of punctuation symbols. Let's say that #stackoverflow #people and #helpfulare terminated with different symbols, you want to retrieve them from text but you may want to avoid repetitions:

>>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!"

if you try with set([i for i in text.split() if i.startswith("#")]) alone, you will get:

>>> set(['#helpful???',
 '#people',
 '#stackoverflow,',
 '#stackoverflow',
 '#helpful!!!',
 '#helpful!',
 '#people...'])

which in my mind is redundant. Better solution using RE with module re:

>>> import re
>>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set(['#people', '#helpful', '#stackoverflow'])

Now it's ok for me.

EDIT: UNICODE #Hashtags

Add the re.UNICODE flag if you want to delete punctuations, but still preserving letters with accents, apostrophes and other unicode-encoded stuff which may be important if the #Hashtags may be expected not to be only in english... maybe this is only an italian guy nightmare, maybe not! ;-)

For example:

>>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!"

will be unicode-encoded as:

>>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!'

and you can retrieve your (correctly encoded) #Hashtags in this way:

>>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])

EDITx2: UNICODE #Hashtags and control for # repetitions

If you want to control for multiple repetitions of the # symbol, as in (forgive me if the text example has become almost unreadable):

>>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!"
>>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!'

then you should substitute these multiple occurrences with a unique #. A possible solution is to introduce another nested implicit set() definition with the sub() function replacing occurrences of more-than-1 # with a single #:

>>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])

answered Nov 03 '22 01:11

Gabriele Pompa

AndiDogs answer will screw up with links and other stuff, you may want to filter them out first. After that use this code:

UTF_CHARS = ur'a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff'
TAG_EXP = ur'(^|[^0-9A-Z&/]+)(#|\uff03)([0-9A-Z_]*[A-Z_]+[%s]*)' % UTF_CHARS
TAG_REGEX = re.compile(TAG_EXP, re.UNICODE | re.IGNORECASE)

It may seem overkill but this has been converted from here http://github.com/mzsanford/twitter-text-java. It will handle like 99% of all hashtags in the same way that twitter handles them.

For more converted twitter regex check out this: http://github.com/BonsaiDen/Atarashii/blob/master/atarashii/usr/share/pyshared/atarashii/formatter.py

EDIT:
Check out: http://github.com/BonsaiDen/AtarashiiFormat

answered Nov 03 '22 03:11

Ivo Wetzel

simple gist (better than chosen answer) https://gist.github.com/mahmoud/237eb20108b5805aed5f also work with unicode hashtags

answered Nov 03 '22 01:11

Victor Gavro

Related questions
                            
                                Python using Beautiful Soup for HTML processing on specific content
                            
                                Python hangs on lxml.etree.XMLSchema(tree) with apache + mod_wsgi
                            
                                using nextSibling from BeautifulSoup outputs nothing
                            
                                Controlling a terminal application with Python
                            
                                Numpy append: Automatically cast an array of the wrong dimension
                            
                                Efficient insert of multiple rows with SQLAlchemy/SQLite3 when duplicate entries exist
                            
                                Alternatives to imp.find_module?
                            
                                Appengine GET parameters
                            
                                Package Import woes in Python
                            
                                Can a method be used as either a staticmethod or instance method?
                            
                                Python Tkinter Font Chooser
                            
                                Property user is corrupt in the datastore:
                            
                                Python3.2 Str.format value repetition
                            
                                How to pass values to pyparsing parseactions?
                            
                                how to communicate two separate python processes?
                            
                                Determining tense of a sentence Python
                            
                                Specifying the schema in Pandas to_sql
                            
                                Long-running ssh commands in python paramiko module (and how to end them)
                            
                                Start, End and Duration of Maximum Drawdown in Python
                            
                                django.db.utils.IntegrityError: duplicate key value violates unique constraint "django_content_type_pkey"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing a tweet to extract hashtags into an array

Tags:

python

arrays

Scott

People also ask

6 Answers

AndiDog

ghostdog74

korniichuk

Gabriele Pompa

Ivo Wetzel

Victor Gavro

Recent Activity

Donate For Us