Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large python dictionary. Storing, loading, and writing to it

I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?

Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).

The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.

Any help would be much appreciated. Thank you!

like image 379
kamykam Avatar asked Dec 25 '18 15:12

kamykam


People also ask

How do I store a large dictionary in Python?

If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.

What is the maximum size of a Python dictionary?

It will not display the output because the computer ran out of memory before reaching 2^27. So there is no size limitation in the dictionary.

How much memory does a dictionary take Python?

This sums up to at least 12 bytes on a 32bit machine and 24 bytes on a 64bit machine. The dictionary starts with 8 empty buckets. This is then resized by doubling the number of entries whenever its capacity is reached.

How many items can I store in a dictionary in Python?

So the answer to your question is: A python dictionary can hold as much as your environment allows it to.


2 Answers

Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.

That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.

Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.

From the Pandas Doc:

Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.

Parquet is designed to faithfully serialize and de-serialize DataFrame s, supporting all of the pandas dtypes, including extension dtypes such as datetime with tz.

Read further here: Pandas Parquet

like image 182
HakunaMaData Avatar answered Sep 28 '22 09:09

HakunaMaData


try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object

it very lightweight and useful library to work with large data

like image 40
frankegoesdown Avatar answered Sep 28 '22 09:09

frankegoesdown