I have data in XML format. Example is shown as follow. I want to extract data from <text> tag.
Here is my XML data.
<text>
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
<h1>The plot</h1>
Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
<h1>Cast</h1>
<h1>Soundtrack</h1>
<h1>External Links</h1>
</text>
I need only The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex. Is it possible? thanks
Use an XML parser to parse XML. Using lxml:
import lxml.etree as ET
content='''\
<text>
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
<h1>The plot</h1>
Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
<h1>Cast</h1>
<h1>Soundtrack</h1>
<h1>External Links</h1>
</text>
'''
text=ET.fromstring(content)
print(text.text)
yields
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
Don't use regular expression to parse XML/HTML. Use a proper parser like BeautifulSoup or lxml in python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With