Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing large CSV files in MySQL using Laravel

I have a csv file that can range from 50k to over 100k rows of data.

I'm currently using Laravel w/ Laravel Forge, MySQL, and Maatwebsite Laravel Excel package.

This is to be used by an end-user and not myself so I have created a simple form on my blade view as such:

{!! Form::open(
    array(
        'route' => 'import.store', 
        'class' => 'form',
        'id' => 'upload',
        'novalidate' => 'novalidate', 
        'files' => true)) !!}

    <div class="form-group">
        <h3>CSV Product Import</h3>
        {!! Form::file('upload_file', null, array('class' => 'file')) !!}
    </div>

    <div class="form-group">
        {!! Form::submit('Upload Products', array('class' => 'btn btn-success')) !!}
    </div>
{!! Form::close() !!}

This then stores the file on the server successfully and I'm now able to iterate through the results using something such as a foreach loop.

Now here are the issues I'm facing in chronological order and fixes/attempts: (10k rows test csv file)

  1. [issue] PHP times out.
  2. [remedy] Changed it to run asynchronously via a job command.
  3. [result] Imports up to 1500 rows.
  4. [issue] Server runs out of memory.
  5. [remedy] Added a swap drive of 1gb.
  6. [result] Imports up to 3000 rows.
  7. [issue] Server runs out of memory.
  8. [remedy] Turned on chunking results of 250 rows each chunk.
  9. [result] Imports up to 5000 rows.
  10. [issue] Server runs out of memory.
  11. [remedy] Removed some tranposing/joined tables logic.
  12. [result] Imports up to 7000 rows.

As you can see the results are marginal and nowhere near 50k, I can barely even make it near 10k.

I've read up and looked into possible suggestions such as:

  • Use a raw query to run Load Data Local Infile.
  • Split files before importing.
  • Store on server then have server split into files and have a cron process them.
  • Upgrade my 512mb DO droplet to 1gb as a last resort.

Going with load data local infile may not work because my header columns could change per file that's why I have logic to process/iterate through them.

Splitting files before importing is fine under 10k but for 50k or more? That would be highly impractical.

Store on server and then have the server split it and run them individually without troubling the end-user? Possibly but not even sure how to achieve this in PHP yet just only briefly read about that.

Also to note, my queue worker is set to timeout in 10000 seconds which is also very impractical and bad-practice but seems that was the only way it will keep running before memory takes a hit.

Now I can give-in and just upgrade the memory to 1gb but I feel at best it may jump me to 20k rows before it fails again. Something needs to process all these rows quickly and efficiently.

Lastly here is a glimpse of my table structure:

Inventory
+----+------------+-------------+-------+---------+
| id | profile_id | category_id |  sku  |  title  |
+----+------------+-------------+-------+---------+
|  1 |         50 |       51234 | mysku | mytitle |
+----+------------+-------------+-------+---------+

Profile
+----+---------------+
| id |     name      |
+----+---------------+
| 50 | myprofilename |
+----+---------------+

Category
+----+------------+--------+
| id | categoryId |  name  |
+----+------------+--------+
|  1 |      51234 | brakes |
+----+------------+--------+

Specifics
+----+---------------------+------------+-------+
| id | specificsCategoryId | categoryId | name  |
+----+---------------------+------------+-------+
|  1 |                  20 |      57357 | make  |
|  2 |                  20 |      57357 | model |
|  3 |                  20 |      57357 | year  |
+----+---------------------+------------+-------+

SpecificsValues
+----+-------------+-------+--------+
| id | inventoryId | name  | value  |
+----+-------------+-------+--------+
|  1 |           1 | make  | honda  |
|  2 |           1 | model | accord |
|  3 |           1 | year  | 1998   |
+----+-------------+-------+--------+

Full CSV Sample
+----+------------+-------------+-------+---------+-------+--------+------+
| id | profile_id | category_id |  sku  |  title  | make  | model  | year |
+----+------------+-------------+-------+---------+-------+--------+------+
|  1 |         50 |       51234 | mysku | mytitle | honda | accord | 1998 |
+----+------------+-------------+-------+---------+-------+--------+------+

So a quick run-through of my logic workflow as simple as possible would be:

  1. Load file into Maatwebsite/Laravel-Excel and iterate through a chunked loop
  2. Check if category_id and sku are empty else ignore and log error to an array.
  3. Lookup category_id and pull all relevant column fields from all related tables it uses and then if no null insert into the database.
  4. Generate a custom title using more logic using the fields available on the file.
  5. Rinse and repeat.
  6. Lastly export the errors array into a file and log it into a database for download to view errors at the end.

I hope someone can share with me some insight on some possible ideas on how I should tackle this while keeping in mind of using Laravel and also that it's not a simple upload I need to process and put into different related tables per line else I'd load data infile it all at once.

Thanks!

like image 248
dmotors Avatar asked Dec 29 '15 00:12

dmotors


1 Answers

You seem to have already figured out the logic for interpreting the CSV lines and converting them to insert queries on the database, so I will focus on the memory exhaustion issue.

When working with large files in PHP, any approach that loads the entire file to memory will either fail, became unbearably slow or require a lot more RAM than you Droplet has.

So my advices are:

Read the file line by line using fgetcsv

$handle = fopen('file.csv', 'r');
if ($handle) {
    while ($line = fgetcsv($handle)) {
        // Process this line and save to database
    }
}

This way only one row at a time will be loaded to memory. Then, you can process it, save to the database, and overwrite it with the next one.

Keep a separate file handle for logging

Your server is short on memory, so logging errors to an array may not be a good idea as all errors will be kept in it. That can become a problem if your csv has lots of entries with empty skus and category ids.

Laravel comes out of the box with Monolog and you can try to adapt it to your needs. However, if it also ends up using too much resources or not fitting your needs, a simpler approach may be the solution.

$log = fopen('log.txt', 'w');
if (some_condition) {
    fwrite($log, $text . PHP_EOL);
}

Then, at the end of the script you can store the log file wherever you want.

Disable Laravel's query log

Laravel keeps all your queries stored in memory, and that's likely to be a problem for your application. Luckily, you can use the disableQueryLog method to free some precious RAM.

DB::connection()->disableQueryLog();

Use raw queries if needed

I think it's unlikely that you will run out of memory again if you follow these tips, but you can always sacrifice some of Laravel's convenience to extract that last drop of performance.

If you know your way around SQL, you can execute raw queries to the database.


Edit:

As for the timeout issue, you should be running this code as a queued task as suggested in the comments regardless. Inserting that many rows WILL take some time (specially if you have lots of indexes) and the user shouldn't be staring at an unresponsive page for that long.

like image 80
Daniel Castro Avatar answered Sep 30 '22 04:09

Daniel Castro