Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently utilize 10+ computers to import data

We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.

My question

How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.

Does anyone have suggestion?

EDIT:

The following is a simplification of the CSV files:

"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue" 
"bvalue";"evenanothervalue"
"avalue";"anothervalue"

After importing, the tables look like this:

dimension_table1

id  name
1   "avalue"
2   "bvalue"

dimension_table2

id  name
1   "anothervalue"
2   "evenanothervalue"

Fact table

  dimension_table1_ID       dimension_table2_ID
    1                      1
    2                      2
    1                       2
    1                       2              
    2                       2
    1                       1
like image 640
Rohita Khatiwada Avatar asked Apr 12 '11 08:04

Rohita Khatiwada


People also ask

How do I import data from access to other sources?

The Get External Data - Access Database import and link wizard opens. In the File name text box, type the name of the source database or click Browse to display the File Open dialog box. Select Import tables, queries, forms, reports, macros, and modules into the current database and click OK.

What can you not import from one Access database into another Access database?

The file format can be either MDB or ACCDB. If the file is in MDE or ACCDE format, you'll only be able to import tables and queries from other Access databases. You can't import forms, reports, macros, and modules into an MDE or ACCDE file.


2 Answers

You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.

With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.

This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.

You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.

See:

http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash

http://code.google.com/p/smhasher/wiki/MurmurHash

http://www.partow.net/programming/hashfunctions/index.html

like image 142
magma Avatar answered Sep 27 '22 21:09

magma


Loading CSV data into a database is slow because it needs to read, split and validate the data.

So what you should try is this:

  1. Setup a local database on each computer. This will get rid of the network latency.

  2. Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.

  3. Dump the data with the DB tools

  4. Load all dumps into a single DB

Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.

That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE

If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.

[EDIT] After checking with your edits, here is what you should try:

  1. Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:

     cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
    

    This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.

  2. Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.

    If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.

  3. Split the huge file into 40 chunks and load each of them on each machine.

    If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.

  4. Use backup/restore or remote tables to merge the results.

    Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.

See here for an introduction.

like image 32
Aaron Digulla Avatar answered Sep 27 '22 22:09

Aaron Digulla