Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing poorly formatted Log files?

I'm working with some log files that are very poorly formatted, the column delimiter is an item that (often) appears within the field and it isn't escaped. For example:

sam,male,september,brown,blue,i like cats, and i like dogs

Where:

name,gender,month,hair,eyes,about

So as you can see, the about contains the column delimiter which means a single parse by the delimiter won't work, because it'll separate the about me into two separate columns. Now imagine this with a chat system... you can visualize the issues I'm sure.

So, theoretically what's the best approach to solving this? I'm not looking for a language specific implementation but more of a general pointer to the correct direction, or some ideas on how others have solved it... without doing it manually.

Edit:

I should clarify, my actual logs are in a much worse state. There are these fields with delimiter characters everywhere, there is no pattern that I can locate.

like image 818
sam Avatar asked Dec 29 '22 06:12

sam


1 Answers

If only the last column have unescaped commas, then most language's implementation of string split can limit the number of splits made, e.g. in Python s.split(',',5)

If you want to parse the file as a CSV (comma separated values) parser, then I think the best approach would be to run a fixer that does proper escaping before passing it to the csv parser.

like image 166
Lie Ryan Avatar answered Jan 13 '23 22:01

Lie Ryan