I'm working with some log files that are very poorly formatted, the column delimiter is an item that (often) appears within the field and it isn't escaped. For example:
sam,male,september,brown,blue,i like cats, and i like dogs
Where:
name,gender,month,hair,eyes,about
So as you can see, the about contains the column delimiter which means a single parse by the delimiter won't work, because it'll separate the about me into two separate columns. Now imagine this with a chat system... you can visualize the issues I'm sure.
So, theoretically what's the best approach to solving this? I'm not looking for a language specific implementation but more of a general pointer to the correct direction, or some ideas on how others have solved it... without doing it manually.
Edit:
I should clarify, my actual logs are in a much worse state. There are these fields with delimiter characters everywhere, there is no pattern that I can locate.
If only the last column have unescaped commas, then most language's implementation of string split can limit the number of splits made, e.g. in Python s.split(',',5)
If you want to parse the file as a CSV (comma separated values) parser, then I think the best approach would be to run a fixer that does proper escaping before passing it to the csv parser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With