split very large text file by max rows

Tags:

I want to split a huge file containing strings into a set of new (smaller) file and tried to use nio2.

I do not want to load the whole file into memory, so I tried it with BufferedReader.

The smaller text files should be limited by the number of text rows.

The solution works, however I want to ask if someone knows a solution with better performance by usion java 8 (maybe lamdas with stream()-api?) and nio2:

public void splitTextFiles(Path bigFile, int maxRows) throws IOException{

        int i = 1;
        try(BufferedReader reader = Files.newBufferedReader(bigFile)){
            String line = null;
            int lineNum = 1;

            Path splitFile = Paths.get(i + "split.txt");
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);

            while ((line = reader.readLine()) != null) {

                if(lineNum > maxRows){
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(i + "split.txt");
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }

                writer.append(line);
                writer.newLine();
                lineNum++;
            }

            writer.close();
        }
}

443

asked Aug 28 '14 16:08

nimo23

1 Answers

Beware of the difference between the direct use of InputStreamReader/OutputStreamWriter and their subclasses and the Reader/Writer factory methods of Files. While in the former case the system’s default encoding is used when no explicit charset is given, the latter always default to UTF-8. So I strongly recommend to always specify the desired charset, even if it’s either Charset.defaultCharset() or StandardCharsets.UTF_8 to document your intention and avoid surprises if you switch between the various ways to create a Reader or Writer.

If you want to split at line boundaries, there is no way around looking into the file’s contents. So you can’t optimize it the way like when merging.

If you are willing to sacrifice the portability you could try some optimizations. If you know that the charset encoding will unambiguously map '\n' to (byte)'\n' as it’s the case for most single byte encodings as well as for UTF-8 you can scan for line breaks on the byte level to get the file positions for the split and avoid any data transfer from your application to the I/O system.

public void splitTextFiles(Path bigFile, int maxRows) throws IOException {
    MappedByteBuffer bb;
    try(FileChannel in = FileChannel.open(bigFile, READ)) {
        bb=in.map(FileChannel.MapMode.READ_ONLY, 0, in.size());
    }
    for(int start=0, pos=0, end=bb.remaining(), i=1, lineNum=1; pos<end; lineNum++) {
        while(pos<end && bb.get(pos++)!='\n');
        if(lineNum < maxRows && pos<end) continue;
        Path splitFile = Paths.get(i++ + "split.txt");
        // if you want to overwrite existing files use CREATE, TRUNCATE_EXISTING
        try(FileChannel out = FileChannel.open(splitFile, CREATE_NEW, WRITE)) {
            bb.position(start).limit(pos);
            while(bb.hasRemaining()) out.write(bb);
            bb.clear();
            start=pos;
            lineNum = 0;
        }
    }
}

The drawbacks are that it doesn’t work with encodings like UTF-16 or EBCDIC and, unlike BufferedReader.readLine() it won’t support lone '\r' as line terminator as used in old MacOS9.

Further, it only supports files smaller than 2GB; the limit is likely even smaller on 32Bit JVMs due to the limited virtual address space. For files larger than the limit, it would be necessary to iterate over chunks of the source file and map them one after another.

These issues could be fixed but would raise the complexity of this approach. Given the fact that the speed improvement is only about 15% on my machine (I didn’t expect much more as the I/O dominates here) and would be even smaller when the complexity raises, I don’t think it’s worth it.

The bottom line is that for this task the Reader/Writer approach is sufficient but you should take care about the Charset used for the operation.

answered Sep 20 '22 14:09

Holger

Related questions
                            
                                Are there any reasons to keep explicit dependency declaration for my own transitive dependencies in Maven?
                            
                                Benefits of using Native in Android [closed]
                            
                                Implement caching Spring method level annotations vs Hibernate second level cache
                            
                                On date change listener
                            
                                Java, DatagramPacket receive, how to determine local ip interface
                            
                                Can someone help me with Android RemoteControlClient?
                            
                                Role of hibernate queries in heap dump
                            
                                Deploying servlets webapp in embedded undertow
                            
                                Java/Scala reflection: Get class methods in order and force object init
                            
                                Sending messages(notification) to a group - Android
                            
                                Hoisting/Reordering in C, C++ and Java: Must variable declarations always be on top in a context?
                            
                                Can I use wait instead of sleep? [duplicate]
                            
                                Invalid method reference for overloaded method with different arities
                            
                                Grails vs Spring performance for REST
                            
                                Implementing Instagram like in-app navigation system on Android
                            
                                Android WebView remove cookies from specific domain
                            
                                How to configure and get session in Hibernate 4.3.4.Final?
                            
                                How see all non threadsafe plugins in maven?
                            
                                Android referral tracking not working with Google play
                            
                                Remove "Project Specific Settings" for all projects in an Eclipse workspace at once?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

split very large text file by max rows

Tags:

java

java-8

nio2

nimo23

People also ask

1 Answers

Holger

Recent Activity

Donate For Us