Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Designing file processing that handles many file formats, parsing, validation, and persistence

If you had to design a file processing component/system, that could take in a wide variety of file formats (including proprietary formats such as Excel), parse/validate and store this information to a DB.. How would you do it?

NOTE : 95% of the time 1 line of input data will equal one record in the database, but not always.

Currently I'm using some custom software I designed to parse/validate/store customer data to our database. The system identifies a file by location in the file system(from an ftp drop) and then loads an XML "definition" file. (The correct XML is loaded based on where the input file was dropped off at).

The XML specifies things like file layout (Delimited or Fixed Width) and field specific items (Length, Data Type(numeric, alpha, alphanumeric), and what DB column to store the field to).

         <delimiter><![CDATA[ ]]></delimiter>
   <numberOfItems>12</numberOfItems>
   <dataItems>
    <item>
     <name>Member ID</name>
     <type>any</type>
     <minLength>0</minLength>
     <maxLength>0</maxLength>
     <validate>false</validate>
     <customValidation/>
     <dbColumn>MembershipID</dbColumn>
    </item>

Because of this design the input files must be text (fixed width or delimited) and have a 1 to 1 relation from input file data field to DB column.

I'd like to extend the capabilities of our file processing system to take in Excel, or other file formats.

There are at least a half dozen ways I can proceed but I'm stuck right now because I don't have anyone to really bounce the ideas off of.

Again : If you had to design a file processing component, that could take in a wide variety of file formats (including proprietary formats such as Excel), parse/validate and store this information to a DB.. How would you do it?

like image 324
BoxOfNotGoodery Avatar asked Nov 15 '22 14:11

BoxOfNotGoodery


1 Answers

Well, a straightforward design is something like...

+-----------+
| reader1   |
|           |---
+-----------+   \---
                    \---   +----------------+               +-------------+
                        \--|  validation    |               |  DB         |
                       /---|                |---------------|             |
+-----------+    /-----    +----------------+               +-------------+
| reader2   |----
|           |
+-----------+

Readers take care of file validation(does the data exist?) and parsing, the Validation section takes care of any business logic, and the DB...is a DB.

So part of what you'd have to design is the Generic ReaderToValidator data container. That's more of a business logic kind of container. I suspect you want the same kind of data regardless of the input format, so G.R.2.V. is not going to be too hard.

You can polymorphic this by designing a GR2V superclass with the Validator method and the data members, then each reader subclasses off of GR2V and fills up the data with its own ReadParseFile method. That's going to introduce a bit more coupling though than having a strict procedural approach. I'd go procedural for this, since data is being procedurally processed in the conceptual design.

like image 135
Paul Nathan Avatar answered Dec 06 '22 11:12

Paul Nathan