How to validate csv file?




How can we validate a CSV file ?

I have an CSV file of structure:

and so on and on !!! approx around 80,000 rows. 

How can I validate this CSV file before starting the parsing using fgetcsv ?

I would not try to validate the file before hand : I would rather prefer going through it line by line, dealing with each line separately :

  • Reading one line
  • Verifying it's OK
  • using the data
  • and going to next line.

Now, what could "verify it's OK" means ?

  • At least : make sure I can read the line as CSV, with my normal set of functions (maybe fgetcsv, maybe some other function specific to my project -- anyway, if I cannot read one line with my function that reads hundreds, it's probably because there's a problem on that line)
  • Then, check for the number of fields
  • then, for each field, check if it contains "valid" data
    • mandatory ? optionnal ?
    • numeric ?
    • string ?
    • date ?
    • and so on
  • then, for each field, some more careful checks
    • for instance, for a "code" field : does it correspond to a value that's legal for my application ?

If all that goes OK -- well, not much more to do, excepts use the data ;-)
And when you're done with one line, just go repeat for the next one.

Of course, if you want to either accept or reject a whole file before doing any database (or anything like that) write, you'll have to :

  • parse the file, line by line, applying the "verifying" ideas
  • store the data of each line in memory
  • and, when the whole file has been read to memory,
    • either start using the data
    • or, if there's been an error on one line, reject everything.

In your specific case, you have three kind of fields :


From what I can guess :

  • The first one must be a date
    • Using some regex to validate that will not be easy : there are not the same number of days each month, there are many months, there is not the same number of days in february depending on the year, ...
    • In such a case, I would probably try to parse the date with something like strtotime (not sure it's ok for the format you're using, though)
    • Or I would just explode the string
      • making sure there are three parts
      • that the third one is 2 digits
      • that the second one is one of Jan, Feb, Mar, ...
      • That the first one corresponds to the correct number of days, depending on the two others
  • The second one :
    • must be an integer
    • must be a valid value, that exists in your database ?
      • If so, a simple SQL query will allow you to check that
  • For the third one, not really sure...
    • I'm guessing it has to be an integer ?
