Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding this CSV header

Tags:

java

parsing

csv

I need to parse a CSV file which has this header:

Company;Registered office;Notifying party;Domicile or Registered office;Holdings of voting rights;;;;;;Publication

;;;;directly held;;additionally counted;;total;;in Germany;;in foreign countries

;;;;percentage;single rights;percentage;single rights;percentage;single rights;Official stock exchange

I was wondering whether this is a standard header format, because I expected to have all the fields listed one after another, like (in the first row) "Holdings of voting rights-directly held-percentage;Holdings of voting rights-directly held-single rights", while I see that information spread over three lines.

Currently my file has 6 lines of header (the three shown and other three in another language), how can I detect, if a day they'll add some more header lines?? The file continues with the following line (the first data) and so on. The first line of real data isn't always the same

BBS Kraftfahrzeugtechnik AG;Schiltach;Baumgartner, Heinrich;Deutschland;62,5;;37,5;;100,0;;Börsenzeitung;04.04.2002

I'm also looking for java libraries which are able to parse CSV files.

like image 256
cdarwin Avatar asked Dec 13 '22 16:12

cdarwin


2 Answers

I disagree to others who claim that only comma is allowed. Wikipedia, for example, gives a case of German CSV which uses semicolons for CSV separation (as commas are used for decimal separation). I think MS Excel is also pretty much flexible on what delimiters to use. It's just programmers' minds that try to gravitate towards most simplistic case.

For CSV parsing I recommend Ostermiller Utils.

Q> how can I detect, if a day they'll add some more header lines?
A> you can't. The only thing you can rely is either dynamic layout (where you know column names in advance) or static layout (where you assume that this column is always n-th).

like image 175
mindas Avatar answered Dec 25 '22 00:12

mindas


Despite CSV (Comma Seperated Value) files having the word comma in their name, I've seen some very weird stuff in the enterprise world.

I would suggest creating your own representation of the data. It sounds like you may be reading in multiple files all formatted a bit differently?

I would approach the problem in a modular fashion. Have importers for the different formats, bring it in to a normalized data representation that you than do what you want with.

This is all assuming that these files contain the same type of data and that you have no control over the files you are receiving.

Even if this is not the case, abstracting out the data from it's representation and sticking that in a separate project would be useful.

I would also recommend the use of OpenCSV

like image 36
Casey Avatar answered Dec 24 '22 23:12

Casey