I am trying to convert mongoDB documents into a flat pandas dataframe structure.
An example of my mongoDB collection structure:
data = collection.find_one({'ID':300})
print(data)
{'_id': "ObjectId('5cd932299f6b7d4c9b95af6c')",
'ID': 300,
'updated': 23424,
'data': [
{ 'meta': 8,
'data': [
{'value1': 1, 'value2': 2},
{'value1': 3, 'value2': 4}
]
},
{ 'meta': 9,
'data': [
{'value1': 5, 'value2': 6}
]
}
]
}
When i put this into a pandas dataframe, I get
df = pd.DataFrame(data)
print(df)
| _id | ID | updated | data
|
|--------------------------|-----|---------|------------------------ ---------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424 | {'meta': 8, 'data': [{'value1': 1, 'value2': 2... |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424 | {'meta': 9, 'data': [{'value1': 5, 'value2': 6}]} |
When I iterate through the dataframe with pd.concat I get
df.rename(columns={'data':'data1'}, inplace=True)
df2 = pd.concat([df, pd.DataFrame(list(df['data1']))], axis=1).drop('data1', 1)
df3 = pd.concat([df2, pd.DataFrame(list(df2['data']))], axis=1).drop('data', 1)
print(df3)
| _id | ID | updated | meta | 0 | 1 |
|--------------------------|-----|---------|------|----------------------------|----------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424 | 8 | {'value1': 1, 'value2': 2} | {'value1': 3, 'value2': 4} |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424 | 9 | {'value1': 5, 'value2': 6} | None |
The lowest level objects of the lowest level array has always the same names.
Therefore I want:
| ID | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424 | 8 | 1 | 2 |
| 300 | 23424 | 8 | 3 | 4 |
| 300 | 23424 | 9 | 5 | 6 |
Am I on the wrong track?
What would be the most convenient way to solve this?
@sinB - You can further improve this by removing the for loop (It will cause issue when dealing with database with many documents). You don't need loop anyway as the result can be converted into pandas dataframe with single command.
Instead of this:
#add each doc as a new row in dataframe
for doc in collection.aggregate(pipeline):
df = df.append(doc,ignore_index=True)
You can use this
query_result = collection.aggregate(pipeline)
query_result = list(query_result)
df = pd.io.json.json_normalize(query_result)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With