I have created a script that read data from a CSV file, check if the data already exists in the database and import it if it does not. If the data does exist (the code of a specific product) then the rest of the information needs to be updated from the CSV file. For example; I have a member with code WTW-2LT, named Alex and surname Johnson in my CSV file. The script checks if the member with code WTW-2LT, named Alex and surname Johnson already exist, if it does, the contact details and extra details needs to be updated from the script (other details like subject and lecturer also needs to be checked, all details are in one line in the CSV), if it doesn't exist the new member just have to be created. My script what I have so far with minimum other checks to prevent distraction for now; <pre class="prettyprint"><code>while ($row = fgetcsv($fp, null, ";")) { if ($header === null) { $header = $row; continue; } $record = array_combine($header, $row); $member = $this->em->getRepository(Member::class)->findOneBy([ 'code' =>$record['member_code'], 'name' =>$record['name'], 'surname' =>$record['surname'], ]); if (!$member) { $member = new Member(); $member->setCode($record['member_code']); $member->setName($record['name']); $member->setName($record['surname']); } $member->setContactNumber($record['phone']); $member->setAddress($record['address']); $member->setEmail($record['email']); $subject = $this->em->getRepository(Subject::class)->findOneBy([ 'subject_code' => $record['subj_code'] ]); if (!$subject) { $subject = new Subject(); $subject->setCode($record['subj_code']); } $subject->setTitle($record['subj_title']); $subject->setDescription($record['subj_desc']); $subject->setLocation($record['subj_loc']); $lecturer = $this->em->getRepository(Lecturer::class)->findOneBy([ 'subject' => $subject, 'name' => $record['lec_name'], 'code' => $record['lec_code'], ]); if (!$lecturer) { $lecturer = new Lecturer(); $lecturer->setSubject($subject); $lecturer->setName($record['lec_name']); $lecturer->setCode($record['lec_code']); } $lecturer->setEmail($record['lec_email']); $lecturer->setContactNumber($record['lec_phone']); $member->setLecturer($lecturer); $validationErrors = $this->validator->validate($member); if (!count($validationErrors)) { $this->em->persist($member); $this->em->flush(); } else { // ... } } </code></pre> You can notice this script has to query the database 3 times to check if one CSV line exists. In my case I have files with up to 2000+ lines, so for every line to perform 3 queries to check if that line exists or not is quite time-consuming. Unfortunately, I also can't import the rows in batches because if one subject doesn't exist it will create it so many times until the batch is flushed to the database and then I sit with duplicate records that serve no point. How can I improve performance and speed to the max? Like first get all records from the database and store it in arrays (memory consuming?) and then do the checks and add the line to the array and check from there... Can somebody please help me find a way to improve this (with sample code please?)

To be honest, I do find 2000+ rows with 3x that amount of queries not that much. But, since you are asking for performance, here are my two cents: Using a framework will always give overhead. Meaning that if you make this code in native PHP, it will already run quicker. Im not familiar with symfony, but I assume you store your data in a database. In MySQL you can use the command <code>INSERT ... ON DUPLICATE KEY update</code>. If you have set the 3 fields (code, name, lastname) as a primary key (which I assume), you can use that to: insert data, but if the key already exists, update the values in the database. MySQL will do the checsk for you, to see if the data has changed: if not, no diskwrite will happen. Im quite certain you can write native SQL to symfony, allowing you to use the security the framework provides, yet speeding up your insert.

Generally if you want performance my best experience has been to dump all the data into the database and then transform it in there using SQL statements. The DBS will be able optimize all of your steps this way. You can import CSV-Files directly into your MySQL database with the SQL command <pre class="prettyprint"><code>LOAD DATA INFILE 'data.csv' INTO TABLE tmp_import </code></pre> The command has a lot of options where you can specify your CSV file's format, for example: <ul> <li>MySQL Ref on LOAD DATA INFILE</li> <li>https://stackoverflow.com/a/18941427/1220835</li> </ul> If your data.csv is a full dump containing all old and new rows then you can just replace your current table with the imported one, after you fixed it up a bit. For example it looks like your csv-file (and import table) might look a bit like <pre class="prettyprint"><code>WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name1, ... WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name2, ... WTW-2LT, Alex, Johnson, subj_code2, ..., lec_name3, ... WTW-2LU, John, Doe, subj_code3, ..., lec_name4, ... </code></pre> You could then get the distinct rows via grouping: <pre class="prettyprint"><code>SELECT member_code, name, surname FROM tmp_import GROUP BY member_code, name, surname </code></pre> If <code>member_code</code> is a key you can just <code>GROUP BY member_code</code> in MySQL. The DBS won't complain even though I believe it's technically against the standard. To get the rest of your data you do the same: <pre class="prettyprint"><code>SELECT subj_code, subj_title, member_code FROM tmp_import GROUP BY subj_code </code></pre> and <pre class="prettyprint"><code>SELECT lec_code, lec_name, subj_code FROM tmp_import GROUP BY lec_code </code></pre> assuming <code>subj_code</code> and <code>lec_code</code> are both keys for subjects and lectures. To actually save this result as a table you can use MySQL's <code>CREATE TABLE ... SELECT</code>-syntax, for example <pre class="prettyprint"><code>CREATE TABLE tmp_import_members SELECT member_code, name, surname FROM tmp_import GROUP BY member_code, name, surname </code></pre> You can then do the inserts and updates in two queries: <pre class="prettyprint"><code>INSERT INTO members (member_code, name, surname) SELECT member_code, name, surname FROM tmp_import_members WHERE tmp_import_members.member_code NOT IN ( SELECT member_code FROM members WHERE member_code IS NOT NULL ); UPDATE members JOIN tmp_import_members ON members.member_code = tmp_import_members.members_code SET members.name = tmp_import_members.name, members.surname = tmp_import_members.surname; </code></pre> and the same for subjects and lectures to your liking. This all amounts to <ul> <li>one bulk import of your CSV file, which should be very fast, </li> <li>3 temporary tables for your members, subjects and lectures,</li> <li>3 insert and 3 update statements (one per table)</li> <li>one drop tables on your temporary tables after you're done</li> </ul> Again: If your CSV-File contains all rows you could just replace your existing tables and save the 3 inserts and 3 updates. Make sure that you create indexes on the relevant columns of your temporary tables so that MySQL can speed up the <code>NOT IN</code> and <code>JOIN</code> in the above queries.

You can run a custom sql first to get the count from all the three tables <pre class="prettyprint"><code>SELECT (SELECT COUNT(*) FROM member WHERE someCondition) as memberCount, (SELECT COUNT(*) FROM subject WHERE someCondition) as subjectCount, (SELECT COUNT(*) FROM lecturer WHERE someCondition) as lecturerCount </code></pre> Then on the basis of count you can find if data is present in your table or not. You don't have to run the queries multiple times for uniqueness if you go with native SQL Checkout this link to know how to run custom SQL in Doctrine Symfony2 & Doctrine: Create custom SQL-Query

Import only non-existing data to database from CSV

Tags:

php

csv

doctrine-orm

symfony-2.8

I have created a script that read data from a CSV file, check if the data already exists in the database and import it if it does not. If the data does exist (the code of a specific product) then the rest of the information needs to be updated from the CSV file.

For example; I have a member with code WTW-2LT, named Alex and surname Johnson in my CSV file. The script checks if the member with code WTW-2LT, named Alex and surname Johnson already exist, if it does, the contact details and extra details needs to be updated from the script (other details like subject and lecturer also needs to be checked, all details are in one line in the CSV), if it doesn't exist the new member just have to be created.

My script what I have so far with minimum other checks to prevent distraction for now;

while ($row = fgetcsv($fp, null, ";")) {
    if ($header === null) {
        $header = $row;
        continue;
    }

    $record = array_combine($header, $row);

    $member = $this->em->getRepository(Member::class)->findOneBy([
        'code' =>$record['member_code'],
        'name' =>$record['name'],
        'surname' =>$record['surname'],
    ]);

    if (!$member) {
        $member = new Member();
        $member->setCode($record['member_code']);
        $member->setName($record['name']);
        $member->setName($record['surname']);
    }    
    $member->setContactNumber($record['phone']);
    $member->setAddress($record['address']);
    $member->setEmail($record['email']);

    $subject = $this->em->getRepository(Subject::class)->findOneBy([
        'subject_code' => $record['subj_code']
    ]);

    if (!$subject) {
        $subject = new Subject();
        $subject->setCode($record['subj_code']);
    }
    $subject->setTitle($record['subj_title']);
    $subject->setDescription($record['subj_desc']);
    $subject->setLocation($record['subj_loc']);

    $lecturer = $this->em->getRepository(Lecturer::class)->findOneBy([
        'subject' => $subject,
        'name' => $record['lec_name'],
        'code' => $record['lec_code'],
    ]);

    if (!$lecturer) {
        $lecturer = new Lecturer();
        $lecturer->setSubject($subject);
        $lecturer->setName($record['lec_name']);
        $lecturer->setCode($record['lec_code']);
    }
    $lecturer->setEmail($record['lec_email']);
    $lecturer->setContactNumber($record['lec_phone']);

    $member->setLecturer($lecturer);

    $validationErrors = $this->validator->validate($member);
    if (!count($validationErrors)) {
        $this->em->persist($member);
        $this->em->flush();
    } else {
        // ...
    }
}

You can notice this script has to query the database 3 times to check if one CSV line exists. In my case I have files with up to 2000+ lines, so for every line to perform 3 queries to check if that line exists or not is quite time-consuming.

Unfortunately, I also can't import the rows in batches because if one subject doesn't exist it will create it so many times until the batch is flushed to the database and then I sit with duplicate records that serve no point.

How can I improve performance and speed to the max? Like first get all records from the database and store it in arrays (memory consuming?) and then do the checks and add the line to the array and check from there...

Can somebody please help me find a way to improve this (with sample code please?)

641

asked Aug 18 '17 14:08

Mentos93

3 Answers

To be honest, I do find 2000+ rows with 3x that amount of queries not that much. But, since you are asking for performance, here are my two cents:

Using a framework will always give overhead. Meaning that if you make this code in native PHP, it will already run quicker. Im not familiar with symfony, but I assume you store your data in a database. In MySQL you can use the command INSERT ... ON DUPLICATE KEY update. If you have set the 3 fields (code, name, lastname) as a primary key (which I assume), you can use that to: insert data, but if the key already exists, update the values in the database. MySQL will do the checsk for you, to see if the data has changed: if not, no diskwrite will happen.

Im quite certain you can write native SQL to symfony, allowing you to use the security the framework provides, yet speeding up your insert.

answered Sep 29 '22 06:09

Jeffrey

Generally if you want performance my best experience has been to dump all the data into the database and then transform it in there using SQL statements. The DBS will be able optimize all of your steps this way.

You can import CSV-Files directly into your MySQL database with the SQL command

LOAD DATA INFILE 'data.csv'
INTO TABLE tmp_import

The command has a lot of options where you can specify your CSV file's format, for example:

MySQL Ref on LOAD DATA INFILE
https://stackoverflow.com/a/18941427/1220835

If your data.csv is a full dump containing all old and new rows then you can just replace your current table with the imported one, after you fixed it up a bit.

For example it looks like your csv-file (and import table) might look a bit like

WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name1, ...
WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name2, ...
WTW-2LT, Alex, Johnson, subj_code2, ..., lec_name3, ...
WTW-2LU, John, Doe,     subj_code3, ..., lec_name4, ...

You could then get the distinct rows via grouping:

SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname

If member_code is a key you can just GROUP BY member_code in MySQL. The DBS won't complain even though I believe it's technically against the standard.

To get the rest of your data you do the same:

SELECT subj_code, subj_title, member_code
FROM tmp_import
GROUP BY subj_code

and

SELECT lec_code, lec_name, subj_code
FROM tmp_import
GROUP BY lec_code

assuming subj_code and lec_code are both keys for subjects and lectures.

To actually save this result as a table you can use MySQL's CREATE TABLE ... SELECT-syntax, for example

CREATE TABLE tmp_import_members
SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname

You can then do the inserts and updates in two queries:

INSERT INTO members (member_code, name, surname)
SELECT member_code, name, surname
FROM tmp_import_members
WHERE tmp_import_members.member_code NOT IN (
  SELECT member_code FROM members WHERE member_code IS NOT NULL
);

UPDATE members 
JOIN tmp_import_members ON 
  members.member_code = tmp_import_members.members_code
SET 
  members.name = tmp_import_members.name,
  members.surname = tmp_import_members.surname;

and the same for subjects and lectures to your liking.

This all amounts to

one bulk import of your CSV file, which should be very fast,
3 temporary tables for your members, subjects and lectures,
3 insert and 3 update statements (one per table)
one drop tables on your temporary tables after you're done

Again: If your CSV-File contains all rows you could just replace your existing tables and save the 3 inserts and 3 updates.

Make sure that you create indexes on the relevant columns of your temporary tables so that MySQL can speed up the NOT IN and JOIN in the above queries.

answered Sep 29 '22 07:09

Basti

You can run a custom sql first to get the count from all the three tables

SELECT
  (SELECT COUNT(*) FROM member WHERE someCondition) as memberCount, 
  (SELECT COUNT(*) FROM subject WHERE someCondition) as subjectCount,
  (SELECT COUNT(*) FROM lecturer WHERE someCondition) as lecturerCount

Then on the basis of count you can find if data is present in your table or not. You don't have to run the queries multiple times for uniqueness if you go with native SQL

Checkout this link to know how to run custom SQL in Doctrine

Symfony2 & Doctrine: Create custom SQL-Query

answered Sep 29 '22 06:09

Siddharth Chaudhery

Related questions
                            
                                PHP mail subject with german letters
                            
                                Alamofire request parameters empty
                            
                                PHP running multiple scripts concurrently
                            
                                PHP xor returns wrong value
                            
                                Validating emails using checkdnsrr, a good or bad solution?
                            
                                WooCommerce Get Order Product Details Before Payment in Plugin
                            
                                How to upload Multiple images in one Request using Retrofit 2 and php as a back end?
                            
                                Amazon Polly Implementation Using PHP SDK
                            
                                Allow users to reload PHP FPM without Sudo
                            
                                Laravel "universal to" in development vs production
                            
                                Adding headers to email sent as a Laravel Notification
                            
                                Php script is automatically being invoked multipletimes
                            
                                Is identical (===) in Twig
                            
                                install curl on ubuntu to use hirak/prestissimo with composer
                            
                                Laravel: How to create a download response for an external file?
                            
                                Object of class GuzzleHttp\Psr7\Request could not be converted to string
                            
                                Why can PHP Functions accept boolean arguments when it asks for integers?
                            
                                how to zip folder in laravel 5?
                            
                                Update all variations prices of a variable product in Woocommerce
                            
                                Symfony queryBuilder: too many queries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With