Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract parent html tag in Python by matching the string

I need to extract the parent tags in html by matching the string in html. (i.e) I have many raw html sources. Each source contains the text value "VIN:*"** with some characters. This text value(VIN:*) is placed in various formats in each source like "< ul >" , "< div >".etc..

Then I need to extract all values along with that "VIN:*" string. It means I need to get its parent tag.

For example,

<div class="class1">

                            Stock Number:
                            Z2079
                            <br>
                            **VIN:
                            2T2HK31UX9C110701**
                            <br>
                            Model Code:
                            9424
                            <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Here I have the "VIN" for the html source. Similar to that I have VIN for other html sources also in different format.

These values have to be extracted in Python.

Is there any way to extract the parent tag by matching the string in Python also in effective way?

like image 529
Nava Avatar asked Nov 23 '25 14:11

Nava


2 Answers

I would strongly recommend going with BeautifulSoup on this; it provides some incredibly convenient functionality for parsing HTML. Here, for example, is how I would go about finding every text node that contains "VIN" in either case:

soup = your_html_here
vins = soup.findAll(text = lambda(x): x.lower.index('vin') != -1)

From there, you simply walk through that collection, grab each node's parent, grab said parent's contents, and parse them as you see fit:

for v in vins:
    parent_html = v.parent.contents
    # more code here
like image 126
ranksrejoined Avatar answered Nov 26 '25 02:11

ranksrejoined


For a so simple task, that consists in ANLYZING the string, not PARSING it (parsing = building a tree representation of the text), you can do :

the text

ss = '''
Humpty Dumpty sat on a wall
<div class="class1">
    Stock Number:
    Z2079
    <br>
        **VIN:
        2T2HK31UX9C110701**
    <br>
    Model Code:
    9424
    <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Humpty Dumpty had a great fall
<ul cat="zoo">
    Stock Number:
    ARDEN3125
    <br>
        **VIN:
        SHAKAMOSK-230478-UBUN**
    </br>
    Model Code:
    101
    <img class="imgcert" src="/images/Magana_cpo.jpg">
</ul>

All the king's horses and all the king's men
<artifice>
    <baradino>
        Stock Number:
        DERT5178
        <br>
            **VIN:
            Pandaia-67-Moro**
        <br>
        Model Code:
        1234
        <img class="imgcert" src="/images/Pertuis_cpo.jpg">
    </baradino>
    what what what who what
    <somerset who="maugham">
        Nothing to declare
    </somerset>
</artifice>

Couldn't put Humpty Dumpty again
<ending rtf="simi">
    Stock Number:
    ZZZ789
    <br>
        **VIN:
        0000012554-ENDENDEND**
    <br>
    Model Code:
    QS78-9
    <img class="imgcert" src="/images/Sunny_cpo.jpg">
</ending>

qsdjgqsjkdhfqjkdhgfjkqshgdfkjqsdjfkh''' 

the code:

import re

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

li = [ (mat.group(1),mat.group(2),mat.group(3).strip(' \n\r\t'))
       for mat in regx.finditer(ss) ]

for el in li:
    print '(%-15r, %-25r, %-25r)' % el

the result

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('baradino'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

re.DOTALL is necessary to give to the dot symbol the ability to match also the newline (by default , a dot in a regular expression pattern matches every character except newlines)

\\1 is way to specify that at this place in the examined string, there must be the same portion of string that is captured by the first group, that is to say the part ([^ >]+)

'(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)' is a part that says that it is forbidden to find a tag other than <br> before the first tag <br> encountered between an opening tag and the closing tag of an HTML element.
This part is necessary to catch the closest preceding tag before VIM apart <br>
If this part isn't present , the regex

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

catches the following result:

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('artifice'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

The difference is 'artifice' instead of 'baradino'

like image 25
eyquem Avatar answered Nov 26 '25 04:11

eyquem



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!