Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expressions: Find Names in String using Python

I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.

This is my string:

<b>Carson Daly</b>: <a href="https://rads.stackoverflow.com/amzn/click/com/B009DA74O8" rel="nofollow noreferrer">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'

I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.

I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.

The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).

Here is another string as an example:

<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>
like image 586
Jake DeVries Avatar asked Apr 17 '26 16:04

Jake DeVries


1 Answers

An alternative approach would be to parse the string with an HTML parser, like lxml.

For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:

from lxml.html import fromstring

l = [
    """<b>Carson Daly</b>: <a href="http://rads.stackoverflow.com/amzn/click/B009DA74O8">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
    """<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]

for html in l:
    tree = fromstring(html)
    results = ''
    for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
        if not isinstance(element, str):
            results += element.text.strip()
        else:
            text = element.strip(':')
            if text:
                results += text.strip()

    print results.split(', ')

It prints:

['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']
like image 149
alecxe Avatar answered Apr 19 '26 06:04

alecxe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!