I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.
This is my string:
<b>Carson Daly</b>: <a href="https://rads.stackoverflow.com/amzn/click/com/B009DA74O8" rel="nofollow noreferrer">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'
I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.
I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.
The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).
Here is another string as an example:
<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>
An alternative approach would be to parse the string with an HTML parser, like lxml.
For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:
from lxml.html import fromstring
l = [
"""<b>Carson Daly</b>: <a href="http://rads.stackoverflow.com/amzn/click/B009DA74O8">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
"""<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]
for html in l:
tree = fromstring(html)
results = ''
for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
if not isinstance(element, str):
results += element.text.strip()
else:
text = element.strip(':')
if text:
results += text.strip()
print results.split(', ')
It prints:
['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With