Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Using fgetcsv on a huge csv file

Tags:

php

fgetcsv

Using fgetcsv, can I somehow do a destructive read where rows I've read and processed would be discarded so if I don't make it through the whole file in the first pass, I can come back and pick up where I left off before the script timed out?

Additional Details:

I'm getting a daily product feed from a vendor that comes across as a 200mb .gz file. When I unpack the file, it turns into a 1.5gb .csv with nearly 500,000 rows and 20 - 25 fields. I need to read this information into a MySQL db, ideally with PHP so I can schedule a CRON to run the script at my web hosting provider every day.

I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.

My idea was to grab the information from the .csv using the fgetcsv function, but I'm expecting to have to take multiple passes at the file because of the 3 minute timeout, I was thinking it would be nice to whittle away at the file as I process it so I wouldn't need to spend cycles skipping over rows that were already processed in a previous pass.

like image 487
Robert82 Avatar asked Oct 22 '13 10:10

Robert82


People also ask

How do I manipulate a large CSV file?

Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How big is too big for a CSV file?

The Difficulty with Opening Big CSVs in Excel Excel is limited to opening CSVs that fit within your computer's RAM. For most modern computers, that means a limit of about 60,000 to 200,000 rows.


3 Answers

From your problem description it really sounds like you need to switch hosts. Processing a 2 GB file with a hard time limit is not a very constructive environment. Having said that, deleting read lines from the file is even less constructive, since you would have to rewrite the entire 2 GB to disk minus the part you have already read, which is incredibly expensive.

Assuming you save how many rows you have already processed, you can skip rows like this:

$alreadyProcessed = 42; // for example

$i = 0;
while ($row = fgetcsv($fileHandle)) {
    if ($i++ < $alreadyProcessed) {
        continue;
    }

    ...
}

However, this means you're reading the entire 2 GB file from the beginning each time you go through it, which in itself already takes a while and you'll be able to process fewer and fewer rows each time you start again.

The best solution here is to remember the current position of the file pointer, for which ftell is the function you're looking for:

$lastPosition = file_get_contents('last_position.txt');
$fh = fopen('my.csv', 'r');
fseek($fh, $lastPosition);

while ($row = fgetcsv($fh)) {
    ...

    file_put_contents('last_position.txt', ftell($fh));
}

This allows you to jump right back to the last position you were at and continue reading. You obviously want to add a lot of error handling here, so you're never in an inconsistent state no matter which point your script is interrupted at.

like image 159
deceze Avatar answered Oct 07 '22 09:10

deceze


You can avoid timeout and memory error to some extent when reading like a Stream. By Reading line by line and then inserts each line into a database (Or Process accordingly). In that way only single line is hold in memory on each iteration. Please note don't try to load a huge csv-file into an array, that really would consume a lot of memory.

if(($handle = fopen("yourHugeCSV.csv", 'r')) !== false)
{
    // Get the first row (Header)
    $header = fgetcsv($handle);

    // loop through the file line-by-line
    while(($data = fgetcsv($handle)) !== false)
    {
        // Process Your Data
        unset($data);
    }
    fclose($handle);
}
like image 42
Jenson M John Avatar answered Oct 07 '22 08:10

Jenson M John


I think a better solution (it will be phenomnally inefficient to continuously rewind and write to open file stream) would be to track the file position of each record read (using ftell) and store it with the data you've read - then if you have to resume, then just fseek to the last position.

You could try loading the file directly using mysql's read file function (which will likely be a lot faster) although I've had problems with this in the past and ended up writing my own php code.

I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.

What have you tried?

The memory can be limited by other means than the php.ini file, but I can't imagine how anyone could actually prevent you from using a different execution time (even if ini_set is disabled, from the command line you could run php -d max_execution_time=3000 /your/script.php or php -c /path/to/custom/inifile /your/script.php )

Unless you are trying to fit the entire datafile into memory then there should be no issue with a memory limit of 128Mb

like image 45
symcbean Avatar answered Oct 07 '22 09:10

symcbean