Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read large files in Java

I need the advice from someone who knows Java very well and the memory issues. I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.

I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.

My file contains text, it is not binary and I have about 20 character per line.

like image 608
CC. Avatar asked Mar 01 '10 13:03

CC.


People also ask

How do I read large files?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.

How do you read an efficient file in Java?

Reading Text Files in Java with BufferedReader If you want to read a file line by line, using BufferedReader is a good choice. BufferedReader is efficient in reading large files. The readline() method returns null when the end of the file is reached. Note: Don't forget to close the file when reading is finished.

How can I read large files on Android?

All you have to do is tap into the text file and open the options and then open it in Jota. 2. NTW Text Editor Pro: Another text editor that you can use for opening the large file size. It can be used for both editing and reading the large files.


8 Answers

To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.

It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.

Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.

Kickoff example:

String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;

try {
    reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
    int count = 0;
    for (String line; (line = reader.readLine()) != null;) {
        if (count++ % maxlines == 0) {
            close(writer);
            writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
        }
        writer.write(line);
        writer.newLine();
    }
} finally {
    close(writer);
    close(reader);
}
like image 112
BalusC Avatar answered Oct 03 '22 09:10

BalusC


First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).

Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).

If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,

like image 35
Michael Borgwardt Avatar answered Oct 03 '22 08:10

Michael Borgwardt


You can consider using memory-mapped files, via FileChannels .

Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.

Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness

like image 32
Ryan Emerle Avatar answered Oct 03 '22 10:10

Ryan Emerle


This is a very good article: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

In summary, for great performance, you should:

  1. Avoid accessing the disk.
  2. Avoid accessing the underlying operating system.
  3. Avoid method calls.
  4. Avoid processing bytes and characters individually.

For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.

like image 28
b.roth Avatar answered Oct 03 '22 08:10

b.roth


Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.

like image 29
Mike Avatar answered Oct 03 '22 10:10

Mike


You can use java.nio which is faster than classical Input/Output stream:

http://java.sun.com/javase/6/docs/technotes/guides/io/index.html

like image 40
Kartoch Avatar answered Oct 03 '22 10:10

Kartoch


Yes. I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file (Eg : read(buffer,0,buffer.length))

And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.

like image 35
Namalak Avatar answered Oct 03 '22 10:10

Namalak


package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;

/**
 * @author Naresh Bhabat
 * 
Following  implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.


Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.



It uses random access file,which is almost like streaming API.


 * ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);

 *  	   for 10 threads:Total time required for reading and writing the text in
 *         :seconds 349.317
 * 
 *         For 100:Total time required for reading the text and writing   : seconds 464.042
 * 
 *         For 1000 : Total time required for reading and writing text :466.538 
 *         For 10000  Total time required for reading and writing in seconds 479.701
 *
 * 
 */
public class DealWithHugeRecordsinFile extends TestCase {

	static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
	static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
	static volatile RandomAccessFile fileToWrite;
	static volatile RandomAccessFile file;
	static volatile String fileContentsIter;
	static volatile int position = 0;

	public static void main(String[] args) throws IOException, InterruptedException {
		long currentTimeMillis = System.currentTimeMillis();

		try {
			fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles 
			file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles 
			seriouslyReadProcessAndWriteAsynch();

		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		Thread currentThread = Thread.currentThread();
		System.out.println(currentThread.getName());
		long currentTimeMillis2 = System.currentTimeMillis();
		double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
		System.out.println("Total time required for reading the text in seconds " + time_seconds);

	}

	/**
	 * @throws IOException
	 * Something  asynchronously serious
	 */
	public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
		ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
		while (true) {
			String readLine = file.readLine();
			if (readLine == null) {
				break;
			}
			Runnable genuineWorker = new Runnable() {
				@Override
				public void run() {
					// do hard processing here in this thread,i have consumed
					// some time and ignore some exception in write method.
					writeToFile(FILEPATH_WRITE, readLine);
					// System.out.println(" :" +
					// Thread.currentThread().getName());

				}
			};
			executor.execute(genuineWorker);
		}
		executor.shutdown();
		while (!executor.isTerminated()) {
		}
		System.out.println("Finished all threads");
		file.close();
		fileToWrite.close();
	}

	/**
	 * @param filePath
	 * @param data
	 * @param position
	 */
	private static void writeToFile(String filePath, String data) {
		try {
			// fileToWrite.seek(position);
			data = "\n" + data;
			if (!data.contains("Randomization")) {
				return;
			}
			System.out.println("Let us do something time consuming to make this thread busy"+(position++) + "   :" + data);
			System.out.println("Lets consume through this loop");
			int i=1000;
			while(i>0){
			
				i--;
			}
			fileToWrite.write(data.getBytes());
			throw new Exception();
		} catch (Exception exception) {
			System.out.println("exception was thrown but still we are able to proceeed further"
					+ " \n This can be used for marking failure of the records");
			//exception.printStackTrace();

		}

	}
}
like image 22
RAM Avatar answered Oct 03 '22 10:10

RAM