I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.
MemoryError:
My code looks like this:
cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)
I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.
However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.
In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?
python - Pyarrow is slower than pandas for csv read in - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational knowledge.
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame . pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_* .
In general, PyMongo provides a rich set of tools that you can use to communicate with a MongoDB server. It provides functionality to query, retrieve results, write and delete data, and run database commands.
I've modified my code to the following:
cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)
By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.
The from_records
classmethod
is probably the best way to do it:
from pandas import pd
import pymongo
client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)
df = pd.DataFrame.from_records(data)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With