Java How to improve reading of 50 Gigabit file

Tags:

I am reading a 50G file containing millions of rows separated by newline character. Presently I am using following syntax to read the file

String line = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("FileName")));
while ((line = br.readLine()) != null)
{
// Processing each line here
// All processing is done in memory. No IO required here.
}

Since the file is too big, it is taking 2 Hrs to process the whole file. Can I improve the reading of file from the harddisk so that the IO(Reading) operation takes minimal time. The restriction with my code is that I have to process each line sequential order.

663

asked Jun 24 '12 15:06

Amit Ruhela

2 Answers

it is taking 2 Hrs to process the whole file.

50 GB / 2 hours equals approximately 7 MB/s. It's not a bad rate at all. A good (modern) hard disk should be capable of sustaining higher rate continuously, so maybe your bottleneck is not the I/O? You're already using BufferedReader, which, like the name says, is buffering (in memory) what it reads. You could experiment creating the reader with a bit bigger buffer than the default size (8192 bytes), like so:

BufferedReader br = new BufferedReader(
    new InputStreamReader(new FileInputStream("FileName")), 100000);

Note that with the default 8192 bytes buffer and 7 MB/s throughput the BufferedReader is going to re-fill its buffer almost 1000 times per second, so lowering that number could really help cutting down some overhead. But if the processing that you're doing, instead of the I/O, is the bottleneck, then no I/O trick is going to help you much. You should maybe consider making it multi-threaded, but whether it's doable, and how, depends on what "processing" means here.

199

answered Sep 29 '22 06:09

Joonas Pulakka

Your only hope is to parallelize the reading and processing of what's inside. Your strategy should be to never require the entire file contents to be in memory at once.

Start by profiling the code you have to see where the time is being spent. Rewrite the part that takes the most time and re-profile to see if it improved. Keep repeating until you get an acceptable result.

I'd think about Hadoop and a distributed solution. Data sets that are larger than yours are processed routinely now. You might need to be a bit more creative in your thinking.

answered Sep 29 '22 05:09

duffymo

Related questions
                            
                                Painting the slider icon of JSlider
                            
                                How to sort an ArrayList<String> based on specific index range
                            
                                Netbeans build-impl.xml error
                            
                                How to remove encoding="UTF-8" standalone="no" from xml Document object in Java
                            
                                Android App Strategy for keeping track of a login session
                            
                                How do you set one array's values to another array's values in Java?
                            
                                SQL7008 Error - Workaround?
                            
                                Serialize object with outputstream
                            
                                SVN error on eclipse with Delete operation (aparently)
                            
                                Int with leading zeroes - unexpected result
                            
                                Java - Declaring Arrays
                            
                                Sending utf8 contents with post method to server in android using HttpClient
                            
                                Convert char[] arrays to String
                            
                                How to get previous year from the result of current year in java?
                            
                                Compiling Multiple Classes (Console) in Java
                            
                                What is "String..." in java? [duplicate]
                            
                                Multiple type array
                            
                                Split text file into Strings on empty line
                            
                                Need Faster way to get RGB value for each Pixel of a Buffered Image
                            
                                Match exactly N repetitions of the same character

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java How to improve reading of 50 Gigabit file

Tags:

java

file

bufferedreader

Amit Ruhela

People also ask

2 Answers

Joonas Pulakka

duffymo

Recent Activity

Donate For Us