Why does Java read a big file faster than C++?

Tags:

I have a 2 GB file (iputfile.txt) in which every line in the file is a word, just like:

apple red beautiful smell spark input

I need to write a program to read every word in the file and print the word count. I wrote it using Java and C++, but the result is surprising: Java runs 2.3 times faster than C++. My code are as follows:

C++:

int main() {     struct timespec ts, te;     double cost;     clock_gettime(CLOCK_REALTIME, &ts);      ifstream fin("inputfile.txt");     string word;     int count = 0;     while(fin >> word) {         count++;     }     cout << count << endl;      clock_gettime(CLOCK_REALTIME, &te);     cost = te.tv_sec - ts.tv_sec + (double)(te.tv_nsec-ts.tv_nsec)/NANO;     printf("Run time: %-15.10f s\n", cost);      return 0; }

Output:

5e+08 Run time: 69.311 s

Java:

 public static void main(String[] args) throws Exception {      long startTime = System.currentTimeMillis();      FileReader reader = new FileReader("inputfile.txt");     BufferedReader br = new BufferedReader(reader);     String str = null;     int count = 0;     while((str = br.readLine()) != null) {         count++;     }     System.out.println(count);      long endTime = System.currentTimeMillis();     System.out.println("Run time : " + (endTime - startTime)/1000 + "s"); }

Output:

5.0E8 Run time: 29 s

Why is Java faster than C++ in this situation, and how do I improve the performance of C++?

581

asked Apr 09 '14 07:04

dodolong

2 Answers

You aren't comparing the same thing. The Java program reads lines, depening on the newline, while the C++ program reads white space delimited "words", which is a little extra work.

Try istream::getline.

Later

You might also try and do an elementary read operation to read a byte array and scan this for newlines.

Even later

On my old Linux notebook, jdk1.7.0_21 and don't-tell-me-it's-old 4.3.3 take about the same time, comparing with C++ getline. (We have established that reading words is slower.) There isn't much difference between -O0 and -O2, which doesn't surprise me, given the simplicity of the code in the loop.

Last note As I suggested, fin.read(buffer,LEN) with LEN = 1MB and using memchr to scan for '\n' results in another speed improvement of about 20%, which makes C (there isn't any C++ left by now) faster than Java.

117

answered Sep 21 '22 12:09

laune

There are a number of significant differences in the way the languages handle I/O, all of which can make a difference, one way or another.

Perhaps the first (and most important) question is: how is the data encoded in the text file. If it is single-byte characters (ISO 8859-1 or UTF-8), then Java has to convert it into UTF-16 before processing; depending on the locale, C++ may (or may not) also convert or do some additional checking.

As has been pointed out (partially, at least), in C++, >> uses a locale specific isspace, getline will simply compare for '\n', which is probably faster. (Typical implementations of isspace will use a bitmap, which means an additional memory access for each character.)

Optimization levels and specific library implementations may also vary. It's not unusual in C++ for one library implementation to be 2 or 3 times faster than another.

Finally, a most significant difference: C++ distinguishes between text files and binary files. You've opened the file in text mode; this means that it will be "preprocessed" at the lowest level, before even the extraction operators see it. This depends on the platform: for Unix platforms, the "preprocessing" is a no-op; on Windows, it will convert CRLF pairs into '\n', which will have a definite impact on performance. If I recall correctly (I've not used Java for some years), Java expects higher level functions to handle this, so functions like readLine will be slightly more complicated. Just guessing here, but I suspect that the additional logic at the higher level costs less in runtime than the buffer preprocessing at the lower level. (If you are testing under Windows, you might experiment with opening the file in binary mode in C++. This should make no difference in the behavior of the program when you use >>; any extra CR will be considered white space. With getline, you'll have to add logic to remove any trailing '\r' to your code.)

answered Sep 21 '22 12:09

James Kanze

Related questions
                            
                                Stable alternative to RXTX
                            
                                How are coroutines implemented in JVM langs without JVM support?
                            
                                How to use a tablename variable for a java prepared statement insert [duplicate]
                            
                                Does variable = null set it for garbage collection
                            
                                Provide time zone to Spring @Scheduled?
                            
                                Clarification of meaning new JVM memory parameters InitialRAMPercentage and MinRAMPercentage
                            
                                Java final modifier
                            
                                How to convert Joda Localdate to Joda DateTime?
                            
                                URL decoding: UnsupportedEncodingException in Java
                            
                                role of multithreading in web application
                            
                                Spring Boot shutdown hook
                            
                                SSL peer shut down incorrectly in Java
                            
                                How does java serialization deserialize final fields when no default constructor specified?
                            
                                How to handle a NumberFormatException with Gson in deserialization a JSON response
                            
                                $0 (Program Name) in Java? Discover main class?
                            
                                Best option for Session management in Java
                            
                                How to cast ArrayList<> from List<>
                            
                                Class vs package vs module vs component vs container vs service vs platform in Java world [closed]
                            
                                Mac OS X 10.6.7 Java Path Current JDK confusing
                            
                                Why does the Java compiler not understand this variable is always initialized?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Java read a big file faster than C++?

Tags:

java

c++

file

dodolong

People also ask

2 Answers

laune

James Kanze

Recent Activity

Donate For Us