I have requirements from our client where we basically have to 'parse' PDF files from various different sources.
The solution we have come with, as 'phase 1' (as we have short time to market and will save them a huge amount of time) is
Manually use Able2Extract application to pull out the columns you need from the PDF file, and spit out an Excel file. This excel file is still very 'dirty' as it contains tons of header information, extra fields that we don't need, etc..
Run our application, feeding it this excel file, which will do the remaining of the cleanup. It takes this 'dirty' Excel file and then gives them a very clean excel file which just has the 3 or 4 columns they need all lines up very neatly.
The first solution we are exploring is using VBA/Excel for step 2). They take their dirty output, paste it in Excel, then run our cleanup macro. Excel is great for this sort of stuff - shifting around and scrubbing data that's already in an Excel spreadsheet. We did a proof of concept with one specific 'source' file, and it came out great. Tooks around half a day to develop this one 'scrubbing script'...
Simple enough huh? Not really. This script only works for one specific file type from one specific source. We will have 10 different sources each with possible 3-10 different file types. That means in the end, we may wind up with a huge Excel macro that has 120 of these very specific 'scrubbing scripts'. So my worry is about long term maintainability here. We might also bump into files that we had never seen before that might 'break' our scrubbing script and have to do a quick re-deply / change to a scrubbing script... I've never used Visual Studio Tools for Office and have minimal experience with VBA Excel Macros - but it seems like this might be a good case here.
Any words of wisdom from someone who might have done something similar to this before? Are huge VBA Macros like what could result here nightmares to maintain? Is VSTFO a good alternative that will give me that 'easy to shift/scrub data' functionality, but with scalability and robustness? To be honest; my first instinct was a pure .NET solution with dynamically compiled scripts pulled from the database, using our Syncfusion Excel API to do the cleaning/scrubbing... but perhaps this is overkill.
First, you are going to need 'n' scrubbing programs no matter what. The fact is, that Excel/VBA is not much worse for maintaining this functionality than many other platforms.
You could add an interface using Userform, or play the auto-detect game, spitting out any 'new' file formats that it doesn't understand. There are several robust error handling schemes available as well, so there's no need to worry that things will get broken.
One Oil company paid me to write an Excel application using 4 Userforms and over 5000 lines of VBA as a tool to assist its accountants in doing monthly joint venture reporting. The application was used for 4 years past its End-of-life because the interface was so familiar and easy to use.
...sorry for rambling on about this, but there is a tendency to 'look down' on VBA because so few 'real programmers' use it...
VBA is much much easier to deal with than VSTO. OK, VBA may not be such a nice language to work with, but at least it gives on-the-metal access to the Excel object model. And a solution based on VBA is likely to be much more stable than one built on VSTO.
I'd say go with VBA, and if you're concerned about the maintainability, think about storing the "scrubbing scripts" in separate files. You can either
(a) have one Excel file per scrubbing script, each with a single macro with the same name; your add-in can load (and execute code in) the appropriate Excel file for any given input file
(b) have one text file per scrubbing script, each with the text of the same macro as above; your add-in can create import this as a new module at run-time - either into itself or into a temporary workbook. This is less efficient, but plays better with version control systems, since you can diff between versions of text files but it's not so easy to diff the modules in two Excel workbooks.
In both of these cases, you can store the scrubbing scripts in a shared folder so that you have centralized update if you need to modify a script.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With