Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting text file without reading it

Tags:

java

file

Is there any method so that I can split a text file in java without reading it?

I want to process a large text file in GB's, so I want to split file in small parts and apply thread over each file and combine result for it.

As I will be reading it for small parts then splitting a file by reading it won't make any sense as I will have to read same file for twice and it will degrade my performance.

like image 891
RamIndani Avatar asked Nov 24 '11 11:11

RamIndani


3 Answers

Your threading attempt is ill formed. If you have to do significant processing with your file data consider following threading structure:

1 Reader Thread (Reads the File and feeds the workers )

  • Queue with read chunks

1..n Worker Threads (n depends on your cpu cores, processes the data chunks from the reader thread)

  • Queue or dictionary with processed chunks

1 Writer Thread ( Writes results to some file)

Maybe you could combine the Reader / Writer thread into one thread because it doesn't make much sense to parallelize IO on the same physical harddisk.

It's clear that you need some synchronization stuff between the threads. Especially for queues think about semaphores

like image 183
Thomas Maierhofer Avatar answered Oct 17 '22 23:10

Thomas Maierhofer


Without reading the content of file you can't do that. That is not possible.

like image 2
KV Prajapati Avatar answered Oct 17 '22 23:10

KV Prajapati


I don't think this is possible for the following reasons:

  1. How do you write a file without "reading" it?
  2. You'll need to read in the text to know where a character boundary is (the encoding is not necessarily 1 byte). This means that you cannot treat the file as binary.

Is it really not possible to read line-by line and process it like that? That also saves additional space that the split files will take up alongside the original. For you reference, reading a text file is simply:

public static void loadFileFromInputStream(InputStream in) throws IOException {
  BufferedReader inputStream = new BufferedReader(new InputStreamReader(in));

  String record = inputStream.readLine();
  while (record != null) {
    // do something with the record
    // ...
    record = inputStream.readLine();
  }
}

You're only reading one line at a time... so the size of the file does not impact performance at all. You can also stop anytime you have to. If you're adventurous you can also add the lines to separate threads to speed up processing. That way, IO can continue churning along while you process your data.

Good luck! If, for some reason, you do find a solution, please post it here. Thanks!

like image 2
Jaco Van Niekerk Avatar answered Oct 17 '22 21:10

Jaco Van Niekerk