There are many solutions to serialize a small dictionary: json.loads
/json.dumps
, pickle
, shelve
, ujson
, or even by using sqlite
.
But when dealing with possibly 100 GB of data, it's not possible anymore to use such modules that would possibly rewrite the whole data when closing / serializing.
redis
is not really an option because it uses a client/server scheme.
Question: Which key:value store, serverless, able to work with 100+ GB of data, are frequently used in Python?
I'm looking for a solution with a standard "Pythonic" d[key] = value
syntax:
import mydb d = mydb.mydb('myfile.db') d['hello'] = 17 # able to use string or int or float as key d[183] = [12, 14, 24] # able to store lists as values (will probably internally jsonify it?) d.flush() # easy to flush on disk
Note: BsdDB (BerkeleyDB) seems to be deprecated. There seems to be a LevelDB for Python, but it doesn't seem well-known - and I haven't found a version which is ready to use on Windows. Which ones would be the most common ones?
Linked questions: Use SQLite as a key:value store, Flat file NoSQL solution
Introduction to the Python Dictionary typeA key in the key-value pair must be immutable. In other words, the key cannot be changed, for example, a number, a string, a tuple, etc. Python uses the curly braces {} to define a dictionary. Inside the curly braces, you can place zero, one, or many key-value pairs.
A key-value database is a type of nonrelational database that uses a simple key-value method to store data. A key-value database stores data as a collection of key-value pairs in which a key serves as a unique identifier. Both keys and values can be anything, ranging from simple objects to complex compound objects.
The key-value store is one of the least complex types of NoSQL databases. This is precisely what makes this model so attractive. It uses very simple functions to store, get and remove data. Apart from those main functions, key-value store databases do not have querying language.
You can use sqlitedict which provides key-value interface to SQLite database.
SQLite limits page says that theoretical maximum is 140 TB depending on page_size
and max_page_count
. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3
2.6.0), are page_size=1024
and max_page_count=1073741823
. This gives ~1100 GB of maximal database size which fits your requirement.
You can use the package like:
from sqlitedict import SqliteDict mydict = SqliteDict('./my_db.sqlite', autocommit=True) mydict['some_key'] = any_picklable_object print(mydict['some_key']) for key, value in mydict.items(): print(key, value) print(len(mydict)) mydict.close()
About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size
pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:
pip install lipsum psutil matplotlib psrecord sqlitedict
sqlitedct.py
#!/usr/bin/env python3 import os import random from contextlib import closing import lipsum from sqlitedict import SqliteDict def main(): with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d: for _ in range(100000): v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)] d[os.urandom(10)] = v if __name__ == '__main__': main()
Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!
. In my case it produces this chart:
And database file:
$ du -h my_db.sqlite 84M my_db.sqlite
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With