Our team uses software that is heavily reliant on dumping NumPy data into files, which slows our code quite a lot. If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.
Other performant methods of storing NumPy arrays in any database or searchable database-like structure are welcome, but PostgresSQL would be preferred.
My question is very similar to one asked previously. However, I am looking for a more robust and performant answer and I wish to store any arbitrary NumPy array.
Arrays can be used to denormalize data and avoid lookup tables. A good rule of thumb for using them that way is that you mostly use the array as a whole, even if you might at times search for elements in the array. Heavier processing is going to be more complex than a lookup table.
Establishing connection using python You can create new connections using the connect() function. This accepts the basic connection parameters such as dbname, user, password, host, port and returns a connection object. Using this function, you can establish a connection with the PostgreSQL.
Set auto-commit to false and create a cursor object. Now, create a list of data to be inserted into the table. Loop through the list and insert values. Commit and close connection.
You can use the np alias to create ndarray of a list using the array() method. The list is passed to the array() method which then returns a NumPy array with the same elements.
Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:
import numpy as np
import psycopg2 as psy
import pickle
db_connect_kwargs = {
'dbname': '<YOUR_DBNAME>',
'user': '<YOUR_USRNAME>',
'password': '<YOUR_PWD>',
'host': '<HOST>',
'port': '<PORT>'
}
connection = psy.connect(**db_connect_kwargs)
connection.set_session(autocommit=True)
cursor = connection.cursor()
cursor.execute(
"""
DROP TABLE IF EXISTS numpy_arrays;
CREATE TABLE numpy_arrays (
uuid VARCHAR PRIMARY KEY,
np_array_bytes BYTEA
)
"""
)
The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays
table, where uuid
is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes
column as bytes.
Inserting into the database:
some_array = np.random.rand(1500,550)
some_array_uuid = 'some_array'
cursor.execute(
"""
INSERT INTO numpy_arrays(uuid, np_array_bytes)
VALUES (%s, %s)
""",
(some_array_uuid, pickle.dumps(some_array))
)
Querying from the database:
uuid = 'some_array'
cursor.execute(
"""
SELECT np_array_bytes
FROM numpy_arrays
WHERE uuid=%s
""",
(uuid,)
)
some_array = pickle.loads(cursor.fetchone()[0])
Performance?
If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.
I haven't benchmarked this approach in any way, so I can't confirm nor refute this...
Disk Space?
My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array)
. If this is an issue consider compressing the bytes before insertion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With