Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

MemoryError:    

My code looks like this:

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

  • https://stackoverflow.com/a/17805626/2297475
  • https://stackoverflow.com/a/16255680/2297475

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

  • https://stackoverflow.com/a/13215411/2297475

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

like image 395
blue_chip Avatar asked Jul 25 '14 19:07

blue_chip


People also ask

Is Pyarrow faster than pandas?

python - Pyarrow is slower than pandas for csv read in - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational knowledge.

Which is the best way to get data in pandas?

pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame . pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_* .

What is PyMongo and what can you do with PyMongo?

In general, PyMongo provides a rich set of tools that you can use to communicate with a MongoDB server. It provides functionality to query, retrieve results, write and delete data, and run database commands.


2 Answers

I've modified my code to the following:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

like image 93
blue_chip Avatar answered Sep 23 '22 20:09

blue_chip


The from_records classmethod is probably the best way to do it:

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)
like image 30
Edgar Ramírez Mondragón Avatar answered Sep 25 '22 20:09

Edgar Ramírez Mondragón