Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join all PostgreSQL tables and make a Python dictionary

I need to join all PostgreSQL tables and convert them in a Python dictionary. There are 72 tables in the database. The total number of columns is greater than 1600.

I wrote a simple Python script that joins several tables but fails to join all of them due to the memory error. All memory is occupied during the script execution. And I run the script on a new virtual server with 128GB RAM and 8 CPU. It fails during the lambda function execution.

How could the following code be improved to execute all tables join?

from sqlalchemy import create_engine
import pandas as pd

auth = 'user:pass'
engine = create_engine('postgresql://' + auth + '@host.com:5432/db')

sql_tables = ['table0', 'table1', 'table3', ..., 'table72']        
df_arr = []
[df_arr.append(pd.read_sql_query('select * from "' + table + '"', con=engine)) for table in sql_tables]

df_join = reduce(lambda left, right: pd.merge(left, right, how='outer', on=['USER_ID']), df_arr)
raw_dict = pd.DataFrame.to_dict(df_join.where((pd.notnull(df_join)), 'no_data'))

print(df_join)
print(raw_dict)
print(len(df_arr))

Is it ok to use Pandas for my purpose? Are there better solutions?

The ultimate goal is to denormalize DB data to be able to index it into Elasticsearch as documents, one document per user.

like image 306
srgbnd Avatar asked Oct 29 '22 20:10

srgbnd


1 Answers

Why don't you create a postgres function instead of script?

Here are some advises that could help you to avoid the memory error:

  • You can use WITH clause which makes better use of your memory.
  • You can create some physical tables for storing the information of different groups of tables of your database. These physical tables will avoid to use a great amount of memory. After that, all you have to do is joining only those physical tables. You can create a function for it.
  • You can create a Data Warehouse by denormalizing the tables you need.
  • Last but not least: Make sure you are using Indexes appropriately.
like image 93
Luis Teijon Avatar answered Nov 15 '22 05:11

Luis Teijon