Let's assume that I have an XML like this:
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">     <documents count="N">         <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>         <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>         <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>         <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>     </documents> </author>  I would like to read this XML file and convert it to a pandas DataFrame:
key                                         type     language    feature            web                         data e95324a9a6c790ecb95e46cf15bE232ee517651      XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...] bc360cfbafc39970587547215162f0db             XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...] 19e71144c50a8b9160b3cvdf2324f0955e906fce     XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...] 21d4af9021a174f61b8erf284606c74d9e42         XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]  This is what I already tried, but I am getting some errors and probably there is a more efficient way of doing this task:
from lxml import objectify import pandas as pd  path = 'file_path' xml = objectify.parse(open(path)) root = xml.getroot() root.getchildren()[0].getchildren() df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))  for i in range(0,len(xml)):     obj = root.getchildren()[i].getchildren()     row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))     row_s = pd.Series(row)     row_s.name = i     df = df.append(row_s)  Could anybody provide me a better aproach for this problem?
The Pandas data analysis library provides functions to read/write data for most of the file types. For example, it includes read_csv() and to_csv() for interacting with CSV files. However, Pandas does not include any methods to read and write XML files.
Therefore, the BeautifulSoup class can also be used to parse XML files directly. The installation of BeautifulSoup has already been discussed at the end of the lesson on Setting up for Python programming.
You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame.  Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):
import pandas as pd import xml.etree.ElementTree as ET import io  def iter_docs(author):     author_attr = author.attrib     for doc in author.iter('document'):         doc_dict = author_attr.copy()         doc_dict.update(doc.attrib)         doc_dict['data'] = doc.text         yield doc_dict  xml_data = io.StringIO(u'''YOUR XML STRING HERE''')  etree = ET.parse(xml_data) #create an ElementTree object  doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))  If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:
def iter_author(etree):     for author in etree.iter('author'):         for row in iter_docs(author):             yield row  and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree tutorial provided in the xml library documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With