Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regarding regular expressions and XML

Tags:

python

regex

xml

I have data in XML format. Example is shown as follow. I want to extract data from <text> tag. Here is my XML data.

    <text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>

I need only The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex. Is it possible? thanks

like image 223
no_freedom Avatar asked Dec 06 '25 10:12

no_freedom


2 Answers

Use an XML parser to parse XML. Using lxml:

import lxml.etree as ET

content='''\
<text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>
'''

text=ET.fromstring(content)
print(text.text)

yields

    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.
like image 69
unutbu Avatar answered Dec 08 '25 01:12

unutbu


Don't use regular expression to parse XML/HTML. Use a proper parser like BeautifulSoup or lxml in python.

like image 45
Gabriel Ross Avatar answered Dec 08 '25 02:12

Gabriel Ross