Large python dictionary. Storing, loading, and writing to it

Tags:

I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?

Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).

The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.

Any help would be much appreciated. Thank you!

379

asked Dec 25 '18 15:12

kamykam

2 Answers

Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.

That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.

Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.

From the Pandas Doc:

Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.

Parquet is designed to faithfully serialize and de-serialize DataFrame s, supporting all of the pandas dtypes, including extension dtypes such as datetime with tz.

Read further here: Pandas Parquet

182

answered Sep 28 '22 09:09

HakunaMaData

try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object

it very lightweight and useful library to work with large data

answered Sep 28 '22 09:09

frankegoesdown

Related questions
                            
                                Using Python to analyze large set of sensor-data
                            
                                Databricks Exception: Total size of serialized results is bigger than spark.driver.maxResultsSize
                            
                                python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit
                            
                                Duplicate node name in graph: 'conv2d_0/kernel/Adam'
                            
                                How to execute python from conda environment by dvc run
                            
                                How to monitor validation loss in the training of estimators in TensorFlow?
                            
                                Matplotlib - Move labels into middle of pie chart
                            
                                Airflow task with null status
                            
                                How to detect multiple plateaus and ascents and descent in the time-series data using python
                            
                                Unittest for invoking a TKinter GUI
                            
                                Retrieving python 3.6 handling of re.sub() with zero length matches in python 3.7
                            
                                python retrieve automatic captions with youtube_dl and transform to transcript
                            
                                Attempting to write a few lines of code to create a master date lookup table
                            
                                How can I exclude certain dates (e.g., weekends) from time series plots?
                            
                                Is there a way to interrupt shutil copytree operation in Python?
                            
                                Pytorch - inference all images and back-propagate batch by batch
                            
                                Removing multiple phrases from string column efficiently
                            
                                ValueError: Cannot take the length of Shape with unknown rank
                            
                                Localhost request from angular
                            
                                How to Insert a Node between another node in a Linked List?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Large python dictionary. Storing, loading, and writing to it

Tags:

performance

python

json

dictionary

graph-theory

kamykam

People also ask

2 Answers

HakunaMaData

frankegoesdown

Recent Activity

Donate For Us