Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you go about batch processing poorly formatted text files?

Tags:

parsing

People complain a lot about XML but, when compared to EDI and some of the proprietary file formats I've dealt with in my career, I think XML is bliss. The work I did on importing data files from Automotive Comparative Raters, each with it's own creative and nightmarish file format, still gives me nightmares.

That being said I'm curious how other programmers approach automated parsing of poorly formatted text files. Do you have a language preference? Are there any automation tools that you find invaluable? How do you make your code reusable?

like image 752
MyItchyChin Avatar asked May 06 '26 13:05

MyItchyChin


2 Answers

A solution I learned about quite recently is using a standalone lexer. You get to use structured regular expressions and you avoid the constraints of a full blown parser generator.

Here are some examples with ocamllex (the lexer generator provided with OCaml):

  • an ocamllex tutorial with some examples.
  • processing of genbank loosely formatted text files (other link which better illustrates the point but hindered by a javascript dialog).

Obviously lexer generators are also available in other languages if using OCaml is an issue for you.

like image 175
bltxd Avatar answered May 10 '26 14:05

bltxd


Perl / Python, build up functionality slowly, keep the worse ones as test case, lots of coffee

like image 30
Martin Beckett Avatar answered May 10 '26 16:05

Martin Beckett



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!