I'm cleaning a series of sources from a twitter stream. Here is an example of the data:
source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']
import re
for i in source:
    re.sub('<.*?>', '', re.sub(r'(<.*?>)(Twitter for)(\s+)', r'', i))
### This would be the expected output ###
'Android Tablets'
'Android'
'foursquare'
'web'
'iPhone'
'BlackBerry'
The later is the code i have that does the job but looks awful. I was hoping there is a better way of doing this including re.sub() or other function that could be more approapiate.
Just another alternative, using BeautifulSoup html parser:
>>> from bs4 import BeautifulSoup
>>> for link in source:
...     print BeautifulSoup(link, 'html.parser').text.replace('Twitter for', '').strip()
... 
Android Tablets
Android
foursquare
web
iPhone
BlackBerry
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With