Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Key: value store in Python for possibly 100 GB of data, without client/server [closed]

There are many solutions to serialize a small dictionary: json.loads/json.dumps, pickle, shelve, ujson, or even by using sqlite.

But when dealing with possibly 100 GB of data, it's not possible anymore to use such modules that would possibly rewrite the whole data when closing / serializing.

redis is not really an option because it uses a client/server scheme.

Question: Which key:value store, serverless, able to work with 100+ GB of data, are frequently used in Python?

I'm looking for a solution with a standard "Pythonic" d[key] = value syntax:

import mydb d = mydb.mydb('myfile.db') d['hello'] = 17          # able to use string or int or float as key d[183] = [12, 14, 24]    # able to store lists as values (will probably internally jsonify it?) d.flush()                # easy to flush on disk  

Note: BsdDB (BerkeleyDB) seems to be deprecated. There seems to be a LevelDB for Python, but it doesn't seem well-known - and I haven't found a version which is ready to use on Windows. Which ones would be the most common ones?


Linked questions: Use SQLite as a key:value store, Flat file NoSQL solution

like image 595
Basj Avatar asked Nov 11 '17 01:11

Basj


People also ask

How does Python store key-value?

Introduction to the Python Dictionary typeA key in the key-value pair must be immutable. In other words, the key cannot be changed, for example, a number, a string, a tuple, etc. Python uses the curly braces {} to define a dictionary. Inside the curly braces, you can place zero, one, or many key-value pairs.

Where do we store data as key-value?

A key-value database is a type of nonrelational database that uses a simple key-value method to store data. A key-value database stores data as a collection of key-value pairs in which a key serves as a unique identifier. Both keys and values can be anything, ranging from simple objects to complex compound objects.

What is key-value store in NoSQL?

The key-value store is one of the least complex types of NoSQL databases. This is precisely what makes this model so attractive. It uses very simple functions to store, get and remove data. Apart from those main functions, key-value store databases do not have querying language.


1 Answers

You can use sqlitedict which provides key-value interface to SQLite database.

SQLite limits page says that theoretical maximum is 140 TB depending on page_size and max_page_count. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3 2.6.0), are page_size=1024 and max_page_count=1073741823. This gives ~1100 GB of maximal database size which fits your requirement.

You can use the package like:

from sqlitedict import SqliteDict  mydict = SqliteDict('./my_db.sqlite', autocommit=True) mydict['some_key'] = any_picklable_object print(mydict['some_key']) for key, value in mydict.items():     print(key, value) print(len(mydict)) mydict.close() 

Update

About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:

pip install lipsum psutil matplotlib psrecord sqlitedict 

sqlitedct.py

#!/usr/bin/env python3  import os import random from contextlib import closing  import lipsum from sqlitedict import SqliteDict  def main():     with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:         for _ in range(100000):             v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]             d[os.urandom(10)] = v  if __name__ == '__main__':     main() 

Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!. In my case it produces this chart: chart

And database file:

$ du -h my_db.sqlite  84M my_db.sqlite 
like image 154
saaj Avatar answered Oct 13 '22 00:10

saaj