Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to Insert Python NumPy array into PostgreSQL database

Our team uses software that is heavily reliant on dumping NumPy data into files, which slows our code quite a lot. If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

Other performant methods of storing NumPy arrays in any database or searchable database-like structure are welcome, but PostgresSQL would be preferred.

My question is very similar to one asked previously. However, I am looking for a more robust and performant answer and I wish to store any arbitrary NumPy array.

like image 508
Daniel Marchand Avatar asked Feb 18 '20 10:02

Daniel Marchand


People also ask

Should you use arrays in Postgres?

Arrays can be used to denormalize data and avoid lookup tables. A good rule of thumb for using them that way is that you mostly use the array as a whole, even if you might at times search for elements in the array. Heavier processing is going to be more complex than a lookup table.

How does Python integrate with PostgreSQL?

Establishing connection using python You can create new connections using the connect() function. This accepts the basic connection parameters such as dbname, user, password, host, port and returns a connection object. Using this function, you can establish a connection with the PostgreSQL.

How do I store a list in PostgreSQL using Python?

Set auto-commit to false and create a cursor object. Now, create a list of data to be inserted into the table. Loop through the list and insert values. Commit and close connection.

How do I input a NumPy array in Python?

You can use the np alias to create ndarray of a list using the array() method. The list is passed to the array() method which then returns a NumPy array with the same elements.


1 Answers

Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:

import numpy as np
import psycopg2 as psy
import pickle

db_connect_kwargs = {
    'dbname': '<YOUR_DBNAME>',
    'user': '<YOUR_USRNAME>',
    'password': '<YOUR_PWD>',
    'host': '<HOST>',
    'port': '<PORT>'
}

connection = psy.connect(**db_connect_kwargs)
connection.set_session(autocommit=True)
cursor = connection.cursor()

cursor.execute(
    """
    DROP TABLE IF EXISTS numpy_arrays;
    CREATE TABLE numpy_arrays (
        uuid VARCHAR PRIMARY KEY,
        np_array_bytes BYTEA
    )
    """
)

The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays table, where uuid is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes column as bytes.

Inserting into the database:

some_array = np.random.rand(1500,550)
some_array_uuid = 'some_array'

cursor.execute(
    """
    INSERT INTO numpy_arrays(uuid, np_array_bytes)
    VALUES (%s, %s)
    """,
    (some_array_uuid, pickle.dumps(some_array))
)

Querying from the database:

uuid = 'some_array'
cursor.execute(
    """
    SELECT np_array_bytes
    FROM numpy_arrays
    WHERE uuid=%s
    """,
    (uuid,)
)
some_array = pickle.loads(cursor.fetchone()[0])

Performance?

If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

I haven't benchmarked this approach in any way, so I can't confirm nor refute this...

Disk Space?

My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array). If this is an issue consider compressing the bytes before insertion.

like image 57
Vlad Avatar answered Oct 17 '22 06:10

Vlad