Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Large Text Files in Real-time (Java)

I'm interested in parsing a fairly large text file in Java (1.6.x) and was wondering what approach(es) would be considered best practice?

The file will probably be about 1Mb in size, and will consist of thousands of entries along the lines of;

Entry
{
    property1=value1
    property2=value2
    ...
}

etc.

My first instinct is to use regular expressions, but I have no prior experience of using Java in a production environment, and so am unsure how powerful the java.util.regex classes are.

To clarify a bit, my application is going to be a web app (JSP) which parses the file in question and displays the various values it retrieves. There is only ever the one file which gets parsed (it resides in a 3rd party directory on the host).

The app will have a fairly low usage (maybe only a handful of users using it a couple of times a day), but it is vital that when they do use it, the information is retrieved as quickly as possible.

Also, are there any precautions to take around loading the file into memory every time it is parsed?

Can anyone recommend an approach to take here?

Thanks

like image 954
Chris McAtackney Avatar asked Apr 23 '09 11:04

Chris McAtackney


2 Answers

If it's going to be about 1MB and literally in the format you state, then it sounds like you're overengineering things.

Unless your server is a ZX Spectrum or something, just use regular expressions to parse it, whack the data in a hash map (and keep it there), and don't worry about it. It'll take up a few megabytes in memory, but so what...?

Update: just to give you a concrete idea of performance, some measurements I took of the performance of String.split() (which uses regular expressions) show that on a 2GHz machine, it takes milliseconds to split 10,000 100-character strings (in other words, about 1 megabyte of data -- actually nearer 2MB in pure volume of bytes, since Strings are 2 bytes per char). Obvioualy, that's not quite the operation you're performing, but you get my point: things aren't that bad...

like image 187
Neil Coffey Avatar answered Oct 19 '22 05:10

Neil Coffey


If it is a proper grammar, use a parser builder such as the GOLD Parsing System. This allows you to specify the format and use an efficient parser to get the tokens you need, getting error-handling almost for free.

like image 34
Lucero Avatar answered Oct 19 '22 03:10

Lucero