Java

Question

I'd like to read the "text8" corpus in Java and reformat some words. The problem is, in this 100MB corpus all words are on one line. So if I try to load it with BufferedReader and readLine, it takes away too much space at once and can't handle it to separate all the words in one list/array.

So my question: Is it possible in Java to read instead of line by line a corpus, to read it word by word? So for example because all words are on one line, to read for example 100 words per iteration?

nafas · Accepted Answer

you can try using Scanner and set the delimiter to whatever suits you:

Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces

while(input.hasNext()){
  System.out.println(input.next());
}

MiKE · Answer

I would suggest you to use the "Character stream" with FileReader

Here is the example code from http://www.tutorialspoint.com/java/java_files_io.htm

import java.io.*;

public class CopyFile {
   public static void main(String args[]) throws IOException
   {
      FileReader in = null;
      FileWriter out = null;

      try {
         in = new FileReader("input.txt");
         out = new FileWriter("output.txt");

         int c;
         while ((c = in.read()) != -1) {
            out.write(c);
         }
      }finally {
         if (in != null) {
            in.close();
         }
         if (out != null) {
            out.close();
         }
      }
   }
}

It reads 16 bit Unicode characters. This way it doesnt matter if your text is in one whole line.

Since you're trying to search word by word, you can easy read till you stumble upon a space and there's your word.

Java - How to read a big file word by word instead of line by line?

Tags:

Rainflow

2 Answers

nafas

MiKE

Recent Activity

Donate For Us