Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you typically import data from a spreadsheet to multiple database columns?

For whatever reason, I have a lot of clients that have existing data that's stored in spreadsheets. Often there are hundreds, if not thousands of items in each spreadsheet, and asking the client to manually enter them through a website (or heaven forbid importing them that way myself) is out of the question. Typically, this data doesn't simply map spreadsheet column to database column. That would be too easy. Often, the data needs to be manipulated before going into the database (data needs to be split by commas, etc) or the data needs to be spread out across multiple tables. Or both.

I'm asking this question, not because I don't know of a multitude of ways to do it, but because I haven't settled on a way that doesn't feel like it takes more work than it should. So far I've taken all of the following approaches (and probably more that I've forgotten):

  • Using Excel to modify the data, so it's a little bit easier to import
  • Importing the entire spreadsheet into a temporary table and then importing with SQL
  • Writing a script and importing the data with it (I've used VBScript, C# and now Ruby)

So far, using a script has been the way that seemed most flexible, but it still feels a little clunky. I have to perform this task enough that I've even contemplated writing a little DSL for it, just to speed things up.

But before I do that, I'm curious, is there a better way?

like image 345
Matt Ephraim Avatar asked Dec 24 '08 04:12

Matt Ephraim


2 Answers

You have to set boundaries, if you can. You should try and provide a template for them to use with the expected data, which includes file type (Excel, csv, etc.), column names, valid values, etc. You should allow the used to browse for the file and upload it on your page/form.

Once the file is uploaded, you need to do validation and importation. You can use ADO.NET, file streams, DTS/SSIS, or Office Automation to do this (if you are using the Microsoft stack). In the validation portion, you should tell the user exactly what they did wrong or need to change. This might include having the validation page have the actual data in a datagrid and providing red labels with errors on the exact row/column. If you use Office Automation, you can give them the exact cell number, but the Office PIA is a pain in the neck.

Once validation is accepted, you can import the information however you like. I prefer putting it into a staging table and using a stored proc to load it, but that's just me. Some prefer to use the object model, but this can be very slow if you have a lot of data.

If you are personally loading these files manually and having to go in and manipulate them, I would suggest finding the communality among them and coming up with a standard to follow. Once you have that, you can make it so the user can do it themselves or you can do it a lot faster yourself.

Yes, this is a lot of work, but in the long wrong, when there is a program that works 95% of the time, everybody wins.

If this is going to be a situation which just can’t be automated, then you will probably just have to have a vanilla staging table and have sql to to the importation. You will have to load the data into one staging table, do the basic manipulation, and then load it into te staging table that your SQL expects.

I’ve done so many imports and ETL tools, and there really is no easy way to handle it. The only way is to really come up with a standard that is reasonable and stick to it and program around that.

like image 127
Charles Graham Avatar answered Nov 29 '22 17:11

Charles Graham


yeah.. that just sucks.

I would go with the script. And I assume you have repeating columns that have to match a single row in another table. I would do reasonable matching and if you encounter a row that the script can't deal with and move the data...then log it and make someone do it manually.

like image 45
Arthur Thomas Avatar answered Nov 29 '22 15:11

Arthur Thomas