Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert nested mongoDB document into flat pandas DataFrame (Array of objects within array of objects)

I am trying to convert mongoDB documents into a flat pandas dataframe structure.

An example of my mongoDB collection structure:

data = collection.find_one({'ID':300})
print(data)

{'_id': "ObjectId('5cd932299f6b7d4c9b95af6c')",
 'ID': 300,
 'updated': 23424,
 'data': [
     { 'meta': 8,
       'data': [
           {'value1': 1, 'value2': 2}, 
           {'value1': 3, 'value2': 4}
       ]
     },
     { 'meta': 9,
       'data': [
           {'value1': 5, 'value2': 6}
       ]
     }
  ]
}

When i put this into a pandas dataframe, I get

df = pd.DataFrame(data)
print(df)

| _id                      | ID  | updated | data                                              
|
|--------------------------|-----|---------|------------------------ ---------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 8, 'data': [{'value1': 1, 'value2': 2... |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 9, 'data': [{'value1': 5, 'value2': 6}]} |

When I iterate through the dataframe with pd.concat I get

df.rename(columns={'data':'data1'}, inplace=True)
df2 = pd.concat([df, pd.DataFrame(list(df['data1']))], axis=1).drop('data1', 1)
df3 = pd.concat([df2, pd.DataFrame(list(df2['data']))], axis=1).drop('data', 1)
print(df3)

| _id                      | ID  | updated | meta | 0                          | 1                          |
|--------------------------|-----|---------|------|----------------------------|----------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 8    | {'value1': 1, 'value2': 2} | {'value1': 3, 'value2': 4} |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 9    | {'value1': 5, 'value2': 6} | None                       |

The lowest level objects of the lowest level array has always the same names.

Therefore I want:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

Am I on the wrong track?

What would be the most convenient way to solve this?

like image 643
sinB Avatar asked Oct 29 '25 19:10

sinB


1 Answers

@sinB - You can further improve this by removing the for loop (It will cause issue when dealing with database with many documents). You don't need loop anyway as the result can be converted into pandas dataframe with single command.

Instead of this:

#add each doc as a new row in dataframe
for doc in collection.aggregate(pipeline): 
    df = df.append(doc,ignore_index=True)

You can use this

query_result = collection.aggregate(pipeline)
query_result = list(query_result)
df = pd.io.json.json_normalize(query_result)
like image 174
Karan Gautam Avatar answered Oct 31 '25 09:10

Karan Gautam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!