Convert nested mongoDB document into flat pandas DataFrame (Array of objects within array of objects)

Question

I am trying to convert mongoDB documents into a flat pandas dataframe structure.

An example of my mongoDB collection structure:

data = collection.find_one({'ID':300})
print(data)

{'_id': "ObjectId('5cd932299f6b7d4c9b95af6c')",
 'ID': 300,
 'updated': 23424,
 'data': [
     { 'meta': 8,
       'data': [
           {'value1': 1, 'value2': 2}, 
           {'value1': 3, 'value2': 4}
       ]
     },
     { 'meta': 9,
       'data': [
           {'value1': 5, 'value2': 6}
       ]
     }
  ]
}

When i put this into a pandas dataframe, I get

df = pd.DataFrame(data)
print(df)

| _id                      | ID  | updated | data                                              
|
|--------------------------|-----|---------|------------------------ ---------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 8, 'data': [{'value1': 1, 'value2': 2... |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | {'meta': 9, 'data': [{'value1': 5, 'value2': 6}]} |

When I iterate through the dataframe with pd.concat I get

df.rename(columns={'data':'data1'}, inplace=True)
df2 = pd.concat([df, pd.DataFrame(list(df['data1']))], axis=1).drop('data1', 1)
df3 = pd.concat([df2, pd.DataFrame(list(df2['data']))], axis=1).drop('data', 1)
print(df3)

| _id                      | ID  | updated | meta | 0                          | 1                          |
|--------------------------|-----|---------|------|----------------------------|----------------------------|
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 8    | {'value1': 1, 'value2': 2} | {'value1': 3, 'value2': 4} |
| 5cd936779f6b7d4c9b95af6d | 300 | 23424   | 9    | {'value1': 5, 'value2': 6} | None                       |

The lowest level objects of the lowest level array has always the same names.

Therefore I want:

| ID  | updated | meta | value1 | value2 |
|-----|---------|------|--------|--------|
| 300 | 23424   | 8    | 1      | 2      |
| 300 | 23424   | 8    | 3      | 4      |
| 300 | 23424   | 9    | 5      | 6      |

Am I on the wrong track?

What would be the most convenient way to solve this?

Karan Gautam · Accepted Answer

@sinB - You can further improve this by removing the for loop (It will cause issue when dealing with database with many documents). You don't need loop anyway as the result can be converted into pandas dataframe with single command.

Instead of this:

#add each doc as a new row in dataframe
for doc in collection.aggregate(pipeline): 
    df = df.append(doc,ignore_index=True)

You can use this

query_result = collection.aggregate(pipeline)
query_result = list(query_result)
df = pd.io.json.json_normalize(query_result)

Convert nested mongoDB document into flat pandas DataFrame (Array of objects within array of objects)

Tags:

python

json

pandas

mongodb

pymongo

sinB

1 Answers

Karan Gautam

Recent Activity

Donate For Us

Convert nested mongoDB document into flat pandas DataFrame (Array of objects within array of objects)

Tags:

python

json

pandas

mongodb

pymongo

sinB

1 Answers

Karan Gautam

Related questions

Recent Activity

Donate For Us