Processing a large amount of data in parallel

Question

I'm a python developer with pretty good RDBMS experience. I need to process a fairly large amount of data (approx 500GB). The data is sitting in approximately 1200 csv files in s3 buckets. I have written a script in Python and can run it on a server. However, it is way too slow. Based on the current speed and the amount of data it will take approximately 50 days to get through all of the files (and of course, the deadline is WELL before that).

Note: the processing is sort of your basic ETL type of stuff - nothing terrible fancy. I could easily just pump it into a temp schema in PostgreSQL, and then run scripts onto of it. But, again, from my initial testing, this would be way to slow.

Note: A brand new PostgreSQL 9.1 database will be it's final destination.

So, I was thinking about trying to spin up a bunch of EC2 instances to try and run them in batches (in parallel). But, I have never done something like this before so I've been looking around for ideas, etc.

Again, I'm a python developer, so it seems like Fabric + boto might be promising. I have used boto from time to time, but never any experience with Fabric.

I know from reading/research this is probably a great job for Hadoop, but I don't know it and can't afford to hire it done, and the time line doesn't allow for a learning curve or hiring someone. I should also not, that it's kind of a one time deal. So, I don't need to build a really elegant solution. I just need for it to work and be able to get through all of the data by the end of the year.

Also, I know this is not a simple stackoverflow-kind of question (something like "how can I reverse a list in python"). But, what I'm hoping for is someone to read this and "say, I do something similar and use XYZ... it's great!"

I guess what I'm asking is does anybody know of any thing out there that I could use to accomplish this task (given that I'm a Python developer and I don't know Hadoop or Java - and have a tight timeline that prevents me learning a new technology like Hadoop or learning a new language)

Thanks for reading. I look forward to any suggestions.

Andreas Florath · Accepted Answer

Did you do some performance measurements: Where are the bottlenecks? Is it CPU bound, IO bound, DB bound?

When it is CPU bound, you can try a python JIT like pypy.

When it is IO bound, you need more HDs (and put some striping md on them).

When it is DB bound, you can try to drop all the indexes and keys first.

Last week I imported the Openstreetmap DB into a postgres instance on my server. The input data were about 450G. The preprocessing (which was done in JAVA here) just created the raw data files which could be imported with postgres 'copy' command. After importing the keys and indices were generated.

Importing all the raw data took about one day - and then it took several days to build keys and indices.

garnaat · Answer

I often use a combination of SQS/S3/EC2 for this type of batch work. Queue up messages in SQS for all of the work that needs to be performed (chunked into some reasonably small chunks). Spin up N EC2 instances that are configured to start reading messages from SQS, performing the work and putting results into S3, and then, and only then, delete the message from SQS.

You can scale this to crazy levels and it has always worked really well for me. In your case, I don't know if you would store results in S3 or go right to PostgreSQL.

georg · Answer

I did something like this some time ago, and my setup was like

one multicore instance (x-large or more), that converts raw source files (xml/csv) into an intermediate format. You can run (num-of-cores) copies of the convertor script on it in parallel. Since my target was mongo, I used json as an intermediate format, in your case it will be sql.
this instance has N volumes attached to it. Once a volume becomes full, it gets detached and attached to the second instance (via boto).
the second instance runs a DBMS server and a script which imports prepared (sql) data into the db. I don't know anything about postgres, but I guess it does have a tool like mysql or mongoimport. If yes, use that to make bulk inserts instead of making queries via a python script.

Processing a large amount of data in parallel

Tags:

python

boto

fabric

data-processing

David S

3 Answers

Andreas Florath

garnaat

georg

Recent Activity

Donate For Us

Processing a large amount of data in parallel

Tags:

python

boto

fabric

data-processing

David S

3 Answers

Andreas Florath

garnaat

georg

Related questions

Recent Activity

Donate For Us