Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reverse massive text file in Java

Tags:

java

file

file-io

What would be the best approach to reverse a large text file that is uploaded asynchronously to a servlet that reverses this file in a scalable and efficient way?

  • text file can be massive (gigabytes long)
  • can assume mulitple server/clustered environment to do this in a distributed manner.
  • open source libraries are encouraged to consider

I was thinking of using Java NIO to treat file as an array on disk (so that I don't have to treat the file as a string buffer in memory). Also, I am thinking of using MapReduce to break up the file and process it in separate machines.

like image 927
DanJanson Avatar asked Apr 27 '10 23:04

DanJanson


2 Answers

If it is uploaded to you and you can get the length at the beginning, you could just create an empty full-sized file up front and write to it starting from the back and working your way to the front using seek

You'd probably want to define a block size (like 1K?) and reverse that much in memory before writing it out to the file.

like image 199
Bill K Avatar answered Sep 28 '22 06:09

Bill K


That's a pretty tough task. If you can ensure that the HTTP Content-Length and Content-Type headers are present in the upload request (or in the multipart body when it's a multipart/form-data request), then it would be an easy job with help of RandomAccessFile. The content length is mandatory so that the RandomAccessFile knows how long the file will be and write the character at the position you want it to be. The character encoding (which is usually present as an attribute of the content type header) is mandatory to know how many bytes a character will take into account (because RandomAccessFile is byte based and for example UTF-8 encoding is variable-byte-length).

Here's a kickoff example (leaving obvious exception handling aside):

package com.stackoverflow.q2725897;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.RandomAccessFile;
import java.io.Reader;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;

public class Test {

    public static void main(String... args) throws Exception {

        // Stub input. You need to gather it yourself from your sources.
        File file = new File("/file.txt");
        long length = file.length(); // Get it from HTTP request header using file upload API in question (Commons FileUpload?).
        String encoding = "UTF-8"; // Get it from HTTP request header using file upload API in question (Commons FileUpload?).
        InputStream content = new FileInputStream(file); // Get it from HTTP request body using file upload API in question (Commons FileUpload?).

        // Now the real job.
        Reader input = new InputStreamReader(content, encoding);
        RandomAccessFile output = new RandomAccessFile(new File("/filereversed.txt"), "rwd");
        CharsetEncoder encoder = Charset.forName(encoding).newEncoder();

        for (int data; (data = input.read()) != -1;) {
            ByteBuffer bytes = encoder.encode(CharBuffer.wrap(new char[] { (char) data }));
            length -= bytes.limit();
            output.seek(length);
            output.write(bytes.array());
        }

        // Should actually be done in finally.
        input.close();
        output.close();
    }

}

If those headers are not present (especially Content-length is important), then you'll really need to store it on disk first until end of stream and then re-read and reverse it the same way with help of RandomAccessFile.

Update: it would actually be tougher than it look like. Is the character encoding of the input always guaranteed to be the same? If so, what one would it be? Also, what would you like to do with for example surrogate characters and newlines? The above example doesn't take that into account correctly. But it at least gives the base idea.

like image 32
BalusC Avatar answered Sep 28 '22 08:09

BalusC