Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices for importing large CSV files

Tags:

import

csv

My company gets a set of CSV files full of bank account info each month that I need to import into a database. Some of these files can be pretty big. For example, one is about 33MB and about 65,000 lines.

Right now I have a symfony/Doctrine app (PHP) that reads these CSV files and imports them into a database. My database has about 35 different tables and on the process of importing, I take these rows, split them up into their constituent objects and insert them into the database. It all works beautifully, except it's slow (each row takes about a quarter second) and it uses a lot of memory.

The memory use is so bad that I have to split up my CSV files. A 20,000-line file barely makes it in. By the time it's near the end, I'm at like 95% memory usage. Importing that 65,000 line file is simply not possible.

I've found symfony to be an exceptional framework for building applications and I normally wouldn't consider using anything else, but in this case I'm willing to throw all my preconceptions out the window in the name of performance. I'm not committed to any specific language, DBMS, or anything.

Stack Overflow doesn't like subjective questions so I'm going to try to make this as un-subjective as possible: for those of you have not just an opinion but experience importing large CSV files, what tools/practices have you used in the past that have been successful?

For example, do you just use Django's ORM/OOP and you haven't had any problems? Or do you read the entire CSV file into memory and prepare a few humongous INSERT statements?

Again, I want not just an opinion, but something that's actually worked for you in the past.

Edit: I'm not just importing an 85-column CSV spreadsheet into one 85-column database table. I'm normalizing the data and putting it into dozens of different tables. For this reason, I can't just use LOAD DATA INFILE (I'm using MySQL) or any other DBMS's feature that just reads in CSV files.

Also, I can't use any Microsoft-specific solutions.

like image 712
Jason Swett Avatar asked Nov 12 '10 16:11

Jason Swett


People also ask

How do I import a large CSV file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How big is too big for a CSV file?

The Difficulty with Opening Big CSVs in Excel Excel is limited to opening CSVs that fit within your computer's RAM. For most modern computers, that means a limit of about 60,000 to 200,000 rows.

How do I read a 10gb CSV file in Python?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.


1 Answers

Forgive me if I'm not exactly understanding your issue correctly, but it seems like you're just trying to get a large amount of CSV data into a SQL database. Is there any reason why you want to use a web app or other code to process the CSV data into INSERT statements? I've had success importing large amounts of CSV data into SQL Server Express (free version) using SQL Server Management Studio and using BULK INSERT statements. A simple bulk insert would look like this:

BULK INSERT [Company].[Transactions]     FROM "C:\Bank Files\TransactionLog.csv"     WITH     (         FIELDTERMINATOR = '|',         ROWTERMINATOR = '\n',         MAXERRORS = 0,         DATAFILETYPE = 'widechar',         KEEPIDENTITY     ) GO 
like image 78
Jeff Camera Avatar answered Sep 24 '22 06:09

Jeff Camera