Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with huge text files in Java

I was given an English vocabulary assignment by my teacher.

Choose a random alphabet, say 'a' Write a word from the alphabet, say 'apple' Take the last word 'e' Write a word from e, say elephant Now from 't' and so on.. No repetition allowed

Make a list of 500 words. Mail the list to the teacher. :)

So Instead of doing it myself, I am working on a Java code which will do my homework for me. The code seems to be simple.

The core of algorithm: Pick up a random word from a dictionary, which satisfies the requirement. seek() with RandomAccessFile. Try to put it in a Set with ordering (maybe LinkedHashSet)

But the problem is the huge size of dictionary with 300 000+ enteries. :| Brute force random algorithms wont work.

What could be the best, quickest and most efficient way out?

****UPDATE :** Now that I have written the code and its working. How can I make it efficient so that it chooses common words? Any text files containing list of common words around??**

like image 989
Nitish Upreti Avatar asked Aug 01 '10 08:08

Nitish Upreti


People also ask

How does Java handle large amounts of data?

Provide more memory to your JVM (usually using -Xmx / -Xms ) or don't load all the data into memory. For many operations on huge amounts of data there are algorithms which don't need access to all of it at once. One class of such algorithms are divide and conquer algorithms.

How do you process large files?

Process Large File In Chunks (BufferdInputStream) We will use BufferedInputStream stream with the same size buffer as we used for FileChannels, and analyse the results. Next is an example of Reading and Writing Large Files in Chunks using Java BufferedInputStream. And, the performance we see is similar to the Scanner.

How do you search for a specific word in a large text file in Java?

Use a method from Scanner object - FindWithinHorizon. Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.


1 Answers

Either look for a data structure allowing you to keep a compacted dictionary in memory, or simply give your process more memory. Three hundred thousand words is not that much.

like image 176
Thorbjørn Ravn Andersen Avatar answered Sep 25 '22 17:09

Thorbjørn Ravn Andersen