Let's assume that I have an XML like this:
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com"> <documents count="N"> <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> </documents> </author>
I would like to read this XML file and convert it to a pandas DataFrame:
key type language feature web data e95324a9a6c790ecb95e46cf15bE232ee517651 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...] bc360cfbafc39970587547215162f0db XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...] 19e71144c50a8b9160b3cvdf2324f0955e906fce XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...] 21d4af9021a174f61b8erf284606c74d9e42 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
This is what I already tried, but I am getting some errors and probably there is a more efficient way of doing this task:
from lxml import objectify import pandas as pd path = 'file_path' xml = objectify.parse(open(path)) root = xml.getroot() root.getchildren()[0].getchildren() df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data')) for i in range(0,len(xml)): obj = root.getchildren()[i].getchildren() row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text])) row_s = pd.Series(row) row_s.name = i df = df.append(row_s)
Could anybody provide me a better aproach for this problem?
The Pandas data analysis library provides functions to read/write data for most of the file types. For example, it includes read_csv() and to_csv() for interacting with CSV files. However, Pandas does not include any methods to read and write XML files.
Therefore, the BeautifulSoup class can also be used to parse XML files directly. The installation of BeautifulSoup has already been discussed at the end of the lesson on Setting up for Python programming.
You can easily use xml
(from the Python standard library) to convert to a pandas.DataFrame
. Here's what I would do (when reading from a file replace xml_data
with the name of your file or file object):
import pandas as pd import xml.etree.ElementTree as ET import io def iter_docs(author): author_attr = author.attrib for doc in author.iter('document'): doc_dict = author_attr.copy() doc_dict.update(doc.attrib) doc_dict['data'] = doc.text yield doc_dict xml_data = io.StringIO(u'''YOUR XML STRING HERE''') etree = ET.parse(xml_data) #create an ElementTree object doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author
, then I would add the following generator:
def iter_author(etree): for author in etree.iter('author'): for row in iter_docs(author): yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree
tutorial provided in the xml
library documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With