Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - How to read a big file word by word instead of line by line?

Tags:

java

I'd like to read the "text8" corpus in Java and reformat some words. The problem is, in this 100MB corpus all words are on one line. So if I try to load it with BufferedReader and readLine, it takes away too much space at once and can't handle it to separate all the words in one list/array.

So my question: Is it possible in Java to read instead of line by line a corpus, to read it word by word? So for example because all words are on one line, to read for example 100 words per iteration?

like image 654
Rainflow Avatar asked Nov 04 '15 10:11

Rainflow


2 Answers

you can try using Scanner and set the delimiter to whatever suits you:

Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces

while(input.hasNext()){
  System.out.println(input.next());
}
like image 131
nafas Avatar answered Oct 26 '22 01:10

nafas


I would suggest you to use the "Character stream" with FileReader

Here is the example code from http://www.tutorialspoint.com/java/java_files_io.htm

import java.io.*;

public class CopyFile {
   public static void main(String args[]) throws IOException
   {
      FileReader in = null;
      FileWriter out = null;

      try {
         in = new FileReader("input.txt");
         out = new FileWriter("output.txt");

         int c;
         while ((c = in.read()) != -1) {
            out.write(c);
         }
      }finally {
         if (in != null) {
            in.close();
         }
         if (out != null) {
            out.close();
         }
      }
   }
}

It reads 16 bit Unicode characters. This way it doesnt matter if your text is in one whole line.

Since you're trying to search word by word, you can easy read till you stumble upon a space and there's your word.

like image 28
MiKE Avatar answered Oct 26 '22 03:10

MiKE