Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better way to use re.sub

I'm cleaning a series of sources from a twitter stream. Here is an example of the data:

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']


import re
for i in source:
    re.sub('<.*?>', '', re.sub(r'(<.*?>)(Twitter for)(\s+)', r'', i))

### This would be the expected output ###
'Android Tablets'
'Android'
'foursquare'
'web'
'iPhone'
'BlackBerry'

The later is the code i have that does the job but looks awful. I was hoping there is a better way of doing this including re.sub() or other function that could be more approapiate.

like image 537
marbel Avatar asked Dec 25 '22 09:12

marbel


1 Answers

Just another alternative, using BeautifulSoup html parser:

>>> from bs4 import BeautifulSoup
>>> for link in source:
...     print BeautifulSoup(link, 'html.parser').text.replace('Twitter for', '').strip()
... 
Android Tablets
Android
foursquare
web
iPhone
BlackBerry
like image 146
alecxe Avatar answered Jan 04 '23 23:01

alecxe