Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: Loading data into Star Schema efficiently

Imagine a table with the following structure on PostgreSQL 9.0:

create table raw_fact_table (text varchar(1000));

For the sake of simplification I only mention one text column, in reality it has a dozen. This table has 10 billion rows and each column has lots of duplicates. The table is created from a flat file (csv) using COPY FROM.

To increase performance I want to convert to the following star schema structure:

create table dimension_table (id int, text varchar(1000));

The fact table would then be replaced with a fact table like the following:

create table fact_table (dimension_table_id int);

My current method is to essentially run the following query to create the dimension table:

Create table dimension_table (id int, text varchar(1000), primary key(id));

then to create fill the dimension table I use:

insert into dimension_table (select null, text from raw_fact_table group by text);

Afterwards I need to run the following query:

select id into fact_table from dimension inner join raw_fact_table on (dimension.text = raw_fact_table.text);

Just imagine the horrible performance I get by comparing all strings to all other strings several times.

On MySQL I could run a stored procedure during the COPY FROM. This could create a hash of a string and all subsequent string comparison is done on the hash instead of the long raw string. This does not seem to be possible on PostgreSQL, what do I do then?

Sample data would be a CSV file containing something like this (I use quotes also around integers and doubles):

"lots and lots of text";"3";"1";"2.4";"lots of text";"blabla"
"sometext";"30";"10";"1.0";"lots of text";"blabla"
"somemoretext";"30";"10";"1.0";"lots of text";"fooooooo"
like image 486
David Avatar asked Jan 03 '11 11:01

David


2 Answers

Just imagine the horrible performance I get by comparing all strings to all other strings several times.

When you've been doing this a while, you stop imagining performance, and you start measuring it. "Premature optimization is the root of all evil."

What does "billion" mean to you? To me, in the USA, it means 1,000,000,000 (or 1e9). If that's also true for you, you're probably looking at between 1 and 7 terabytes of data.

My current method is to essentially run the following query to create the dimension table:

Create table dimension_table (id int, text varchar(1000), primary key(id));

How are you gonna fit 10 billion rows into a table that uses an integer for a primary key? Let's even say that half the rows are duplicates. How does that arithmetic work when you do it?

Don't imagine. Read first. Then test.

Read Data Warehousing with PostgreSQL. I suspect these presentation slides will give you some ideas.

Also read Populating a Database, and consider which suggestions to implement.

Test with a million (1e6) rows, following a "divide and conquer" process. That is, don't try to load a million at a time; write a procedure that breaks it up into smaller chunks. Run

EXPLAIN <sql statement>

You've said you estimate at least 99% duplicate rows. Broadly speaking, there are two ways to get rid of the dupes

  1. Inside a database, not necessarily the same platform you use for production.
  2. Outside a database, in the filesystem, not necessarily the same filesystem you use for production.

If you still have the text files that you loaded, I'd consider first trying outside the database. This awk one-liner will output unique lines from each file. It's relatively economical, in that it makes only one pass over the data.

awk '!arr[$0]++' file_with_dupes > file_without_dupes

If you really have 99% dupes, by the end of this process you should have reduced your 1 to 7 terabytes down to about 50 gigs. And, having done that, you can also number each unique line and create a tab-delimited file before copying it into the data warehouse. That's another one-liner:

awk '{printf("%d\t%s\n", NR, $0);}' file_without_dupes > tab_delimited_file

If you have to do this under Windows, I'd use Cygwin.

If you have to do this in a database, I'd try to avoid using your production database or your production server. But maybe I'm being too cautious. Moving several terabytes around is an expensive thing to do.

But I'd test

SELECT DISTINCT ...

before using GROUP BY. I might be able to do some tests on a large data set for you, but probably not this week. (I don't usually work with terabyte-sized files. It's kind of interesting. If you can wait.)

like image 59
Mike Sherrill 'Cat Recall' Avatar answered Oct 02 '22 11:10

Mike Sherrill 'Cat Recall'


Just to questions: - it neccessary to convert your data in 1 or 2 steps? - May we modify the table while converting?

Running more simplier queries may improve your performance (and the server load while doing it)

One approach would be:

  1. generate dimension_table (If i understand it correctly, you don't have performance problems with this) (maybe with an additional temporary boolean field...)
  2. repeat: choose one previously not selected entry from dimension_table, select every rows from raw_fact_table containing it and insert them into fact_table. Mark dimension_table record as done, and next... You can write this as a stored procedure, and it can convert your data in the background, eating minimal resources...

Or another (probably better):

  1. create fact_table as EVERY record from raw_fact_table AND one dimension_id. (so including dimension_text and dimension_id rows)
  2. create dimension_table
  3. create an after insert trigger for fact_table which:
    • searches for dimension_text in fact_table
    • if not found, creates a new record in dimension_table
    • updates dimension_id to this id
  4. in a simle loop, insert every record from raw_fact_table to fact_table
like image 45
Dutow Avatar answered Oct 02 '22 12:10

Dutow