Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process CSV with 100k+ lines in PHP?

Tags:

php

mysql

csv

I have a CSV file with more than 100.000 lines, each line has 3 values separated by semicolon. Total filesize is approx. 5MB.

CSV file is in this format:

stock_id;product_id;amount
==========================
1;1234;0
1;1235;1
1;1236;0
...
2;1234;3
2;1235;2
2;1236;13
...
3;1234;0
3;1235;2
3;1236;0
...

We have 10 stocks which are indexed 1-10 in CSV. In database we have them saved as 22-31.

CSV is sorted by stock_id, product_id but I think it doesn't matter.

What I have

<?php

session_start();

require_once ('db.php');

echo '<meta charset="iso-8859-2">';

// convert table: `CSV stock id => DB stock id`
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

$sql = $mysqli->query("SELECT product_id FROM table WHERE fielddef_id = 1");

while ($row = $sql->fetch_assoc()) {
    $products[$row['product_id']] = 1;
}

$csv = file('export.csv');

// go thru CSV file and prepare SQL UPDATE query
foreach ($csv as $row) {
    $data = explode(';', $row);
    // $data[0] - stock_id
    // $data[1] - product_id
    // $data[2] - amount

    if (isset($products[$data[1]])) {
        // in CSV are products which aren't in database
        // there is echo which should show me queries
        echo "  UPDATE t 
                SET value = " . (int)$data[2] . " 
                WHERE   fielddef_id = " . (int)$stocks[$data[0]] . " AND 
                        product_id = '" . $data[1] . "' -- product_id isn't just numeric
                LIMIT 1<br>";
    }
}

Problem is that writing down 100k lines by echo is soooo slow, takes long minutes. I'm not sure what MySQL will do, if it will be faster, or take Âą the same time. I have no testing machine here, so I'm worry about testing in on prod server.

My idea was to load CSV file into more variables (better array) like below, but I don't know why.

$csv[0] = lines 0      - 10.000;
$csv[1] = lines 10.001 - 20.000;
$csv[2] = lines 20.001 - 30.000;
$csv[3] = lines 30.001 - 40.000;
etc. 

I found eg. Efficiently counting the number of lines of a text file. (200mb+), but I'm not sure how it can help me.

When I replace foreach for print_r, I get dump in < 1 sec. The task is to make the foreach loop with database update faster.

Any ideas how to updates so many records in database?
Thanks.

like image 673
pavel Avatar asked Feb 17 '26 17:02

pavel


2 Answers

Something like this (please note this is 100% untested and off top of my head may need some tweaking to actually work :) )

//define array may (probably better ways of doing this
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

$handle = fopen("file.csv", "r")); //open file
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
    //loop through csv

    $updatesql = "UPDATE t SET `value` = ".$data[2]." WHERE   fielddef_id = ".$stocks[$data[0]]." AND product_id = ".$data[1];
   echo "$updatesql<br>";//for debug only comment out on live
}

There is no need to do your initial select since you're only ever setting your product data to 1 anyway in your code and it looks from your description that your product id's are always correct its just your fielddef column which has the map.

Also just for live don't forget to put your actual mysqli execute command in on your $updatesql;

To give you a comparison to actual usage code (I can benchmark against!) This is some code I use for an importer of an uploaded file (its not perfect but it does its job)

if (isset($_POST['action']) && $_POST['action']=="beginimport") {
            echo "<h4>Starting Import</h4><br />";
            // Ignore user abort and expand time limit 
            //ignore_user_abort(true);
            set_time_limit(60);
                if (($handle = fopen($_FILES['clientimport']['tmp_name'], "r")) !== FALSE) {
                    $row = 0;
                    //defaults 
                    $sitetype = 3;
                    $sitestatus = 1;
                    $startdate = "2013-01-01 00:00:00";
                    $enddate = "2013-12-31 23:59:59";
                    $createdby = 1;
                    //loop and insert
                    while (($data = fgetcsv($handle, 10000, ",")) !== FALSE) {  // loop through each line of CSV. Returns array of that line each time so we can hard reference it if we want.
                        if ($row>0) {
                            if (strlen($data[1])>0) {
                                $clientshortcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0])));
                                $sitename = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0]))." ".trim(stripslashes($data[1])));
                                $address = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[1])).",".trim(stripslashes($data[2])).",".trim(stripslashes($data[3])));
                                $postcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[4])));
                                //look up client ID
                                $client = $db->queryUniqueObject("SELECT ID FROM tblclients WHERE ShortCode='$clientshortcode'",ENABLE_DEBUG);

                                if ($client->ID>0 && is_numeric($client->ID)) {
                                    //got client ID so now check if site already exists we can trust the site name here since we only care about double matching against already imported sites.
                                    $sitecount = $db->countOf("tblsites","SiteName='$sitename'");
                                    if ($sitecount>0) {
                                        //site exists
                                        echo "<strong style=\"color:orange;\">SITE $sitename ALREADY EXISTS SKIPPING</strong><br />";
                                    } else {
                                        //site doesn't exist so do import
                                        $db->execute("INSERT INTO tblsites (SiteName,SiteAddress,SitePostcode,SiteType,SiteStatus,CreatedBy,StartDate,EndDate,CompanyID) VALUES 
                                        ('$sitename','$address','$postcode',$sitetype,$sitestatus,$createdby,'$startdate','$enddate',".$client->ID.")",ENABLE_DEBUG);
                                        echo "IMPORTED - ".$data[0]." - ".$data[1]."<br />";
                                    }
                                } else {
                                    echo "<strong style=\"color:red;\">CLIENT $clientshortcode NOT FOUND PLEASE ENTER AND RE-IMPORT</strong><br />";
                                }
                                fcflush();
                                set_time_limit(60); // reset timer on loop
                            }
                        } else {
                            $row++;
                        }
                    } 
                    echo "<br />COMPLETED<br />";
                }
                fclose($handle);
                unlink($_FILES['clientimport']['tmp_name']);
            echo "All Imports finished do not reload this page";
        }

That imported 150k rows in about 10 seconds

like image 61
Dave Avatar answered Feb 20 '26 05:02

Dave


Due to answers and comments for the question, I have the solution. The base for that is from @Dave, I've only updated it to pass better to question.

<?php

require_once 'include.php';

// stock convert table (key is ID in CSV, value ID in database)
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

// product IDs in CSV (value) and Database (product_id) are different. We need to take both IDs from database and create an array of e-shop products
$sql = mysql_query("SELECT product_id, value FROM cms_module_products_fieldvals WHERE fielddef_id = 1") or die(mysql_error());

while ($row = mysql_fetch_assoc($sql)) {
    $products[$row['value']] = $row['product_id'];
}

$handle = fopen('import.csv', 'r');
$i = 1;

while (($data = fgetcsv($handle, 1000, ';')) !== FALSE) {
    $p_id = (int)$products[$data[1]];

    if ($p_id > 0) {
        // if product exists in database, continue. Without this condition it works but we do many invalid queries to database (... WHERE product_id = 0 updates nothing, but take a time)
        if ($i % 300 === 0) {
            // optional, we'll see what it do with the real traffic
            sleep(1);
        }

        $updatesql = "UPDATE table SET value = " . (int)$data[2] . " WHERE fielddef_id = " . $stocks[$data[0]] . " AND product_id = " . (int)$p_id . " LIMIT 1";
        echo "$updatesql<br>";//for debug only comment out on live
        $i++;
    }
}

// cca 1.5sec to import 100.000k+ records
fclose($handle);
like image 29
pavel Avatar answered Feb 20 '26 06:02

pavel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!