I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?
Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).
The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.
Any help would be much appreciated. Thank you!
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
It will not display the output because the computer ran out of memory before reaching 2^27. So there is no size limitation in the dictionary.
This sums up to at least 12 bytes on a 32bit machine and 24 bytes on a 64bit machine. The dictionary starts with 8 empty buckets. This is then resized by doubling the number of entries whenever its capacity is reached.
So the answer to your question is: A python dictionary can hold as much as your environment allows it to.
Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.
That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.
Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.
From the Pandas Doc:
Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame s, supporting all of the pandas dtypes, including extension dtypes such as datetime with tz.
Read further here: Pandas Parquet
try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object
it very lightweight and useful library to work with large data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With