For whatever reason, I have a lot of clients that have existing data that's stored in spreadsheets. Often there are hundreds, if not thousands of items in each spreadsheet, and asking the client to manually enter them through a website (or heaven forbid importing them that way myself) is out of the question. Typically, this data doesn't simply map spreadsheet column to database column. That would be too easy. Often, the data needs to be manipulated before going into the database (data needs to be split by commas, etc) or the data needs to be spread out across multiple tables. Or both.
I'm asking this question, not because I don't know of a multitude of ways to do it, but because I haven't settled on a way that doesn't feel like it takes more work than it should. So far I've taken all of the following approaches (and probably more that I've forgotten):
So far, using a script has been the way that seemed most flexible, but it still feels a little clunky. I have to perform this task enough that I've even contemplated writing a little DSL for it, just to speed things up.
But before I do that, I'm curious, is there a better way?
You have to set boundaries, if you can. You should try and provide a template for them to use with the expected data, which includes file type (Excel, csv, etc.), column names, valid values, etc. You should allow the used to browse for the file and upload it on your page/form.
Once the file is uploaded, you need to do validation and importation. You can use ADO.NET, file streams, DTS/SSIS, or Office Automation to do this (if you are using the Microsoft stack). In the validation portion, you should tell the user exactly what they did wrong or need to change. This might include having the validation page have the actual data in a datagrid and providing red labels with errors on the exact row/column. If you use Office Automation, you can give them the exact cell number, but the Office PIA is a pain in the neck.
Once validation is accepted, you can import the information however you like. I prefer putting it into a staging table and using a stored proc to load it, but that's just me. Some prefer to use the object model, but this can be very slow if you have a lot of data.
If you are personally loading these files manually and having to go in and manipulate them, I would suggest finding the communality among them and coming up with a standard to follow. Once you have that, you can make it so the user can do it themselves or you can do it a lot faster yourself.
Yes, this is a lot of work, but in the long wrong, when there is a program that works 95% of the time, everybody wins.
If this is going to be a situation which just can’t be automated, then you will probably just have to have a vanilla staging table and have sql to to the importation. You will have to load the data into one staging table, do the basic manipulation, and then load it into te staging table that your SQL expects.
I’ve done so many imports and ETL tools, and there really is no easy way to handle it. The only way is to really come up with a standard that is reasonable and stick to it and program around that.
yeah.. that just sucks.
I would go with the script. And I assume you have repeating columns that have to match a single row in another table. I would do reasonable matching and if you encounter a row that the script can't deal with and move the data...then log it and make someone do it manually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With