Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance tips on importing large sets of data

I have a function that lets users import contacts (email address and name). Some users imports files with around 70,000 contacts. Can be either xls or csv. This is what I have now.

  1. The user selects a file (from their computer) that they want to import contacts from.
  2. I save the file on the server and creat a database entry with a reference to the file location.
  3. Amazon SQS is used to handle this in the background.
  4. The first time the job runs I process the file, only saving rows with email address and name (if found). The data is saved to a json file in the same location and cached. Then I release the job back in the queue.
  5. Now the contacts are ready to be imported. I take 1000 contacts in each job and saves each contact in its own row in the database. I use array_slice to skip contacts in the json file. The skip count is saved to the database.
  6. When no contacts left, the job is deleted and everything is done.

This is pretty much the whole process. I also have a check (database lookup) to check for duplicates. Only unique email addresses is allowed.

The problem I have is that the job seems to take too much time and I get timeouts. This leads to the import taking to much time.

So my question is: is there anything I can do better?

Let me know if you need anything else. I don't have much experience with big data and many users.

EDIT: I don't need code. What I'm after is like, is the problem a server issue? Maybe moving the database to its own server will do the trick? Or should I just use a different approach?

EDIT 2: The user can see the progress of the import. So I need to calculate the number of contacts and to do that I need to first filter away rows with no email address. And I also trims it and the name column. When I did this I found easier to save the new set of data to a JSON file.

EDIT 3: The timeouts happens when saving users to the database, not in the initial processing and creation of the json file.

EDIT 4: One way to speed it the job might be to save it into chunks from the beginning (in the first processing). This way I don't need to handle a skip counter and I don't have to use array_slice on the large data set. Also now when I think about it, it's kind of stupid to save it to a json file and then cache it. Why don't cache the array from the beginning?

like image 239
tbleckert Avatar asked Sep 16 '25 23:09

tbleckert


1 Answers

I take 1000 contacts in each job and saves each contact in its own row in the database.

I've face the problem it too before, but in my problem I need to import employee presence about 50000 records, I've figure it out using parallelization. You might be noticed it too, so you take in 1000 contacts in each job queue. The real problem is that "Process Time out" we face right if we take it so much ?

So, my solution against that is to create more child process to do a job. If I create one job to do 1000 import, it'll uses more time and slower. So, I create 100 jobs queue with each job import 100 records. And I run it together. In this method your CPU loads will increase because of that. It's not a problem if you've high spec computer.

My propose is :

  1. Create more jobs queue to do an import.
  2. Avoid using too much looping.
  3. If possible, store your data in memcached, because it'll speed up your process. I guess you think it too. Read about APC

You can read it here how to store your data in memory. Hope this help you a little :)

like image 162
Eko Junaidi Salam Avatar answered Sep 18 '25 18:09

Eko Junaidi Salam