Best way to Insert Python NumPy array into PostgreSQL database

Tags:

Our team uses software that is heavily reliant on dumping NumPy data into files, which slows our code quite a lot. If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

Other performant methods of storing NumPy arrays in any database or searchable database-like structure are welcome, but PostgresSQL would be preferred.

My question is very similar to one asked previously. However, I am looking for a more robust and performant answer and I wish to store any arbitrary NumPy array.

508

asked Feb 18 '20 10:02

Daniel Marchand

1 Answers

Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:

import numpy as np
import psycopg2 as psy
import pickle

db_connect_kwargs = {
    'dbname': '<YOUR_DBNAME>',
    'user': '<YOUR_USRNAME>',
    'password': '<YOUR_PWD>',
    'host': '<HOST>',
    'port': '<PORT>'
}

connection = psy.connect(**db_connect_kwargs)
connection.set_session(autocommit=True)
cursor = connection.cursor()

cursor.execute(
    """
    DROP TABLE IF EXISTS numpy_arrays;
    CREATE TABLE numpy_arrays (
        uuid VARCHAR PRIMARY KEY,
        np_array_bytes BYTEA
    )
    """
)

The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays table, where uuid is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes column as bytes.

Inserting into the database:

some_array = np.random.rand(1500,550)
some_array_uuid = 'some_array'

cursor.execute(
    """
    INSERT INTO numpy_arrays(uuid, np_array_bytes)
    VALUES (%s, %s)
    """,
    (some_array_uuid, pickle.dumps(some_array))
)

Querying from the database:

uuid = 'some_array'
cursor.execute(
    """
    SELECT np_array_bytes
    FROM numpy_arrays
    WHERE uuid=%s
    """,
    (uuid,)
)
some_array = pickle.loads(cursor.fetchone()[0])

Performance?

If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

I haven't benchmarked this approach in any way, so I can't confirm nor refute this...

Disk Space?

My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array). If this is an issue consider compressing the bytes before insertion.

answered Oct 17 '22 06:10

Vlad

Related questions
                            
                                C++ and Python ZeroMQ 4.x PUB/SUB example does not work
                            
                                virtualenv activation doesn't work
                            
                                unable to execute python script from php
                            
                                Reticulate not sharing state between R/Python cells or Python/Python cells in RMarkdown
                            
                                Chunking big datasets in PyRFC. Possible?
                            
                                Compile Python 3.6 script to standalone exe with Nuitka on Windows 10
                            
                                Open / load image as numpy ndarray directly
                            
                                BeautifulSoup: Return None if HTML element not found
                            
                                R equivalent of Python's dask
                            
                                Django ORM: window function with subsequent filtering
                            
                                How to manually close a websocket
                            
                                NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:snowflake
                            
                                Why Pearson correlation is different between Tensorflow and Scipy
                            
                                How to use the logging module in Python with gunicorn
                            
                                How do I monitor how busy a Python event loop is?
                            
                                Why does pipenv fail to install a package inside a docker container
                            
                                How can I limit user input length on python?
                            
                                How to disable a "Reload site? Changes you made may not be saved" popup for (python) selenium tests in chrome?
                            
                                How to fix "No matching distribution found for {package name}" when installing own package from test.pypi [duplicate]
                            
                                Type-Hinting Child class returning self

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to Insert Python NumPy array into PostgreSQL database

Tags:

python

sql

postgresql

numpy

Daniel Marchand

People also ask

1 Answers

Vlad

Recent Activity

Donate For Us