How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.
I'd like to replace XML tags with a single whitespace.
You can use the XML parser to transform a string of XML text in UTF-8 encoding into an XML object representation of the string. The XML parser generates an output schema for the XML object, displayed in the flow editor as a tree structure that shows each element with its data type, as well as any attributes..
The page uses the XMLHttpRequest (JavaScript) object to fetch the XML file (sample. xml) then parses it in JavaScript and creates the chart. The function that parses the XML response and then uses the data to create the chart is shown below and called myXMLProcessor() (it's the XMLHttpRequest callback function).
Using stdlib xml.etree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
print(ET.tostring(tree.getroot(), encoding='utf-8', method='text'))
I really like BeautifulSoup, and would rather not use regex on HTML if we can avoid it.
Adapted from: [this StackOverflow Answer], [BeautifulSoup documentation]
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt) # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)
Though of course, you can (and should) use BeautifulSoup to navigate the page for what you are looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With