Can someone please help convert the following XML file to Pandas dataframe:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<bathrooms type="dict">
<n35237 type="number">1.0</n35237>
<n32238 type="number">3.0</n32238>
<n44699 type="number">nan</n44699>
</bathrooms>
<price type="dict">
<n35237 type="number">7020000.0</n35237>
<n32238 type="number">10000000.0</n32238>
<n44699 type="number">4128000.0</n44699>
</price>
<property_id type="dict">
<n35237 type="number">35237.0</n35237>
<n32238 type="number">32238.0</n32238>
<n44699 type="number">44699.0</n44699>
</property_id>
</root>
It should look like this --
OUTPUT
This is the code I have written:-
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('real_state.xml')
root = tree.getroot()
dfcols = ['property_id', 'price', 'bathrooms']
df_xml = pd.DataFrame(columns=dfcols)
for node in root:
property_id = node.attrib.get('property_id')
price = node.attrib.get('price')
bathrooms = node.attrib.get('bathrooms')
df_xml = df_xml.append(
pd.Series([property_id, price, bathrooms], index=dfcols),
ignore_index=True)
print(df_xml)
I am getting None everywhere, instead of the actual values. Can someone please tell how it can be fixed. Thanks!
if the data is simple, like this, then you can do something like:
from lxml import objectify
xml = objectify.parse('Document1.xml')
root = xml.getroot()
bathrooms = [child.text for child in root['bathrooms'].getchildren()]
price = [child.text for child in root['price'].getchildren()]
property_id = [child.text for child in root['property_id'].getchildren()]
data = [bathrooms, price, property_id]
df = pd.DataFrame(data).T
df.columns = ['bathrooms', 'price', 'property_id']
bathrooms price property_id
0 1.0 7020000.0 35237.0
1 3.0 10000000.0 32238.0
2 nan 4128000.0 44699.0
if it is more complex then a loop is better. You can do something like
from lxml import objectify
xml = objectify.parse('Document1.xml')
root = xml.getroot()
data=[]
for i in range(len(root.getchildren())):
data.append([child.text for child in root.getchildren()[i].getchildren()])
df = pd.DataFrame(data).T
df.columns = ['bathrooms', 'price', 'property_id']
I have had success using this function from the xmltodict package:
import xmltodict
xmlDict = xmltodict.parse(xmlData)
df = pd.DataFrame.from_dict(xmlDict)
What I like about this, is I can easily do some dictionary manipulation in between parsing the xml and making my df. Also, it helps to explore the data as a dict if the structure is wily.
Hello all I found another really easily way to solve those question. reference: https://www.youtube.com/watch?v=WVrg5-cjr5k
import xml.etree.ElementTree as ET
import pandas as pd
import codecs
## open notebook and save your xml file to text.xml
with codecs.open('text.xml', 'r', encoding='utf8') as f:
tt = f.read()
def xml2df(xml_data):
root = ET.XML(xml_data)
all_records = []
for i, child in enumerate(root):
record = {}
for sub_child in child:
record[sub_child.tag] = sub_child.text
all_records.append(record)
return pd.DataFrame(all_records)
df_xml1 = xml2df(tt)
print(df_xml1)
for better understanding of ET you can use underneath code to see what in side of your xml
import xml.etree.ElementTree as ET
import pandas as pd
import codecs
with codecs.open('text.xml', 'r', encoding='utf8') as f:
tt = f.read()
root = ET.XML(tt)
print(type(root))
print(root[0])
for ele in root[0]:
print(ele.tag + '////' + ele.text)
print(root[0][0].tag)
Once you finish running the program you can see the output underneath:
C:\Users\username\Documents\pycode\Scripts\python.exe C:/Users/username/PycharmProjects/DestinationLight/try.py
n35237 n32238 n44699
0 1.0 3.0 nan
1 7020000.0 10000000.0 4128000.0
2 35237.0 32238.0 44699.0
<class 'xml.etree.ElementTree.Element'>
<Element 'bathrooms' at 0x00000285006B6180>
n35237////1.0
n32238////3.0
n44699////nan
n35237
Process finished with exit code 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With