I need to join all PostgreSQL tables and convert them in a Python dictionary. There are 72 tables in the database. The total number of columns is greater than 1600.
I wrote a simple Python script that joins several tables but fails to join all of them due to the memory error. All memory is occupied during the script execution. And I run the script on a new virtual server with 128GB RAM and 8 CPU. It fails during the lambda function execution.
How could the following code be improved to execute all tables join?
from sqlalchemy import create_engine
import pandas as pd
auth = 'user:pass'
engine = create_engine('postgresql://' + auth + '@host.com:5432/db')
sql_tables = ['table0', 'table1', 'table3', ..., 'table72']
df_arr = []
[df_arr.append(pd.read_sql_query('select * from "' + table + '"', con=engine)) for table in sql_tables]
df_join = reduce(lambda left, right: pd.merge(left, right, how='outer', on=['USER_ID']), df_arr)
raw_dict = pd.DataFrame.to_dict(df_join.where((pd.notnull(df_join)), 'no_data'))
print(df_join)
print(raw_dict)
print(len(df_arr))
Is it ok to use Pandas for my purpose? Are there better solutions?
The ultimate goal is to denormalize DB data to be able to index it into Elasticsearch as documents, one document per user.
Why don't you create a postgres function instead of script?
Here are some advises that could help you to avoid the memory error:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With