I'd like to read the "text8" corpus in Java and reformat some words. The problem is, in this 100MB corpus all words are on one line. So if I try to load it with BufferedReader
and readLine
, it takes away too much space at once and can't handle it to separate all the words in one list/array.
So my question: Is it possible in Java to read instead of line by line a corpus, to read it word by word? So for example because all words are on one line, to read for example 100 words per iteration?
you can try using Scanner
and set the delimiter to whatever suits you:
Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces
while(input.hasNext()){
System.out.println(input.next());
}
I would suggest you to use the "Character stream" with FileReader
Here is the example code from http://www.tutorialspoint.com/java/java_files_io.htm
import java.io.*;
public class CopyFile {
public static void main(String args[]) throws IOException
{
FileReader in = null;
FileWriter out = null;
try {
in = new FileReader("input.txt");
out = new FileWriter("output.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c);
}
}finally {
if (in != null) {
in.close();
}
if (out != null) {
out.close();
}
}
}
}
It reads 16 bit Unicode characters. This way it doesnt matter if your text is in one whole line.
Since you're trying to search word by word, you can easy read till you stumble upon a space and there's your word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With