Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert an XML file to nice pandas dataframe?

Let's assume that I have an XML like this:

<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">     <documents count="N">         <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>         <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>         <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>         <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]>         </document>     </documents> </author> 

I would like to read this XML file and convert it to a pandas DataFrame:

key                                         type     language    feature            web                         data e95324a9a6c790ecb95e46cf15bE232ee517651      XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...] bc360cfbafc39970587547215162f0db             XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...] 19e71144c50a8b9160b3cvdf2324f0955e906fce     XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...] 21d4af9021a174f61b8erf284606c74d9e42         XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...] 

This is what I already tried, but I am getting some errors and probably there is a more efficient way of doing this task:

from lxml import objectify import pandas as pd  path = 'file_path' xml = objectify.parse(open(path)) root = xml.getroot() root.getchildren()[0].getchildren() df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))  for i in range(0,len(xml)):     obj = root.getchildren()[i].getchildren()     row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))     row_s = pd.Series(row)     row_s.name = i     df = df.append(row_s) 

Could anybody provide me a better aproach for this problem?

like image 752
eoriu Avatar asked Feb 01 '15 03:02

eoriu


People also ask

Can pandas read XML files?

The Pandas data analysis library provides functions to read/write data for most of the file types. For example, it includes read_csv() and to_csv() for interacting with CSV files. However, Pandas does not include any methods to read and write XML files.

Does beautiful soup work with XML?

Therefore, the BeautifulSoup class can also be used to parse XML files directly. The installation of BeautifulSoup has already been discussed at the end of the lesson on Setting up for Python programming.


1 Answers

You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):

import pandas as pd import xml.etree.ElementTree as ET import io  def iter_docs(author):     author_attr = author.attrib     for doc in author.iter('document'):         doc_dict = author_attr.copy()         doc_dict.update(doc.attrib)         doc_dict['data'] = doc.text         yield doc_dict  xml_data = io.StringIO(u'''YOUR XML STRING HERE''')  etree = ET.parse(xml_data) #create an ElementTree object  doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) 

If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:

def iter_author(etree):     for author in etree.iter('author'):         for row in iter_docs(author):             yield row 

and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))

Have a look at the ElementTree tutorial provided in the xml library documentation.

like image 91
JaminSore Avatar answered Sep 21 '22 23:09

JaminSore