Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices for iterating over MASSIVE CSV files in PHP

Tags:

php

csv

Ok, I'll try and keep this short, sweet and to-the-point.

We do massive GeoIP updates to our system by uploading a MASSIVE CSV file to our PHP-based CMS. This thing usually has more than 100k records of IP address information. Now, doing a simple import of this data isn't an issue at all, but we have to run checks against our current regional IP address mappings.

This means that we must validate the data, compare and split overlapping IP address, etc.. And these checks must be made for each and every record.

Not only that, but I've just created a field mapping solution that would allow other vendors to implement their GeoIP updates in different formats. This is done by applying rules to IPs records within the CSV update.

For instance a rule might look like:

if 'countryName' == 'Australia' then send to the 'Australian IP Pool'

There might be multiple rules that have to be run and each IP record must apply them all. For instance, 100k records to check against 10 rules would be 1 million iterations; not fun.

We're finding 2 rules for 100k records takes up to 10 minutes to process. I'm fully aware of the bottleneck here which is the shear amount of iterations that must occur for a successful import; just not fully aware of any other options we may have to speed things up a bit.

Someone recommended splitting the file into chunks, server-side. I don't think this is a viable solution as it adds yet another layer of complexity to an already complex system. The file would have to be opened, parsed and split. Then the script would have to iterate over the chunks as well.

So, question is, considering what I just wrote, what would the BEST method be to speed this process up a bit? Upgrading the server's hardware JUST for this tool isn't an option unfortunately, but they're pretty high-end boxes to begin with.

Not as short as I thought, but yeah. Halps? :(

like image 872
Wilhelm Murdoch Avatar asked May 11 '09 01:05

Wilhelm Murdoch


2 Answers

Perform a BULK IMPORT into a database (SQL Server's what I use). The BULK IMPORT takes seconds literally, and 100,000 records is peanuts for a database to crunch on business rules. I regularly perform similar data crunches on a table with over 4 million rows and it doesn't take the 10 minutes you listed.

EDIT: I should point out, yeah, I don't recommend PHP for this. You're dealing with raw DATA, use a DATABASE.. :P

like image 52
Some Canuck Avatar answered Nov 12 '22 02:11

Some Canuck


The simple key to this is keeping as much work out of the inner loop as possible.

Simply put, anything you do in the inner loop is done "100K times", so doing nothing is best (but certainly not practical), so doing as little possible is the next best bet.

If you have the memory, for example, and it's practical for the application, defer any "output" until after the main processing. Cache any input data if practical as well. This works best for summary data or occasional data.

Ideally, save for the reading of the CSV file, do as little I/O as possible during the main processing.

Does PHP offer any access to the Unix mmap facility, that is typically the fastest way to read files, particularly large files.

Another consideration is to batch your inserts. For example, it's straightforward to build up your INSERT statements as simple strings, and ship them to the server in blocks of 10, 50, or 100 rows. Most databases have some hard limit on the size of the SQL statement (like 64K, or something), so you'll need to keep that in mind. This will dramatically reduce your round trips to the DB.

If you're creating primary keys through simple increments, do that en masses (blocks of 1000, 10000, whatever). This is another thing you can remove from your inner loop.

And, for sure, you should be processing all of the rules at once for each row, and not run the records through for each rule.

like image 45
Will Hartung Avatar answered Nov 12 '22 03:11

Will Hartung