Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split large file into chunks

Tags:

java

I have a method which accept file and size of chunks and return list of chunked files. But the main problem that my line in file could be broken, for example in main file I have next lines:

|1|aaa|bbb|ccc|
|2|ggg|ddd|eee|

After split I could have in one file:

|1|aaa|bbb

In another file:

|ccc|2|
|ggg|ddd|eee|

Here is the code:

public static List<File> splitFile(File file, int sizeOfFileInMB) throws    IOException {
  int counter = 1;
  List<File> files = new ArrayList<>();

  int sizeOfChunk = 1024 * 1024 * sizeOfFileInMB;
  byte[] buffer = new byte[sizeOfChunk];

  try (BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file))) {
    String name = file.getName();

    int tmp = 0;
    while ((tmp = bis.read(buffer)) > 0) {
        File newFile = new File(file.getParent(), name + "."
                + String.format("%03d", counter++));
        try (FileOutputStream out = new FileOutputStream(newFile)) {
            out.write(buffer, 0, tmp);
        }

        files.add(newFile);
    }
  }

  return files;
}

Should I use RandomAccessFile class for above purposes (main file is really big - more then 5 Gb)?

like image 867
Iurii Avatar asked Sep 14 '15 00:09

Iurii


2 Answers

If you don't mind to have chunks of different lengths (<=sizeOfChunk but closest to it) then here is the code:

public static List<File> splitFile(File file, int sizeOfFileInMB) throws IOException {
    int counter = 1;
    List<File> files = new ArrayList<File>();
    int sizeOfChunk = 1024 * 1024 * sizeOfFileInMB;
    String eof = System.lineSeparator();
    try (BufferedReader br = new BufferedReader(new FileReader(file))) {
        String name = file.getName();
        String line = br.readLine();
        while (line != null) {
            File newFile = new File(file.getParent(), name + "."
                    + String.format("%03d", counter++));
            try (OutputStream out = new BufferedOutputStream(new FileOutputStream(newFile))) {
                int fileSize = 0;
                while (line != null) {
                    byte[] bytes = (line + eof).getBytes(Charset.defaultCharset());
                    if (fileSize + bytes.length > sizeOfChunk)
                        break;
                    out.write(bytes);
                    fileSize += bytes.length;
                    line = br.readLine();
                }
            }
            files.add(newFile);
        }
    }
    return files;
}

The only problem here is file charset which is default system charset in this example. If you want to be able to change it let me know. I'll add third parameter to "splitFile" function for it.

like image 94
rsutormin Avatar answered Sep 21 '22 21:09

rsutormin


Just in case anyone is interested in a Kotlin version. It creates an iterator of ByteArray chunks:

    class ByteArrayReader(val input: InputStream, val chunkSize: Int, val bufferSize: Int = 1024*8): Iterator<ByteArray> {
    
        var eof: Boolean = false
    
        init {
            if ((chunkSize % bufferSize) != 0) {
                throw RuntimeException("ChunkSize(${chunkSize}) should be a multiple of bufferSize (${bufferSize})")
            }
        }
        override fun hasNext(): Boolean = !eof
    
        override fun next(): ByteArray {
            var buffer = ByteArray(bufferSize)
            var chunkWriter = ByteArrayOutputStream(chunkSize) // no need to close - implementation is empty
            var bytesRead = 0
            var offset = 0
            while (input.read(buffer).also { bytesRead = it } > 0) {
                if (chunkWriter.use { out ->
                            out.write(buffer, 0, bytesRead)
                            out.flush()
                            offset += bytesRead
                            offset == chunkSize
                        }) {
                    return chunkWriter.toByteArray()
                }
            }
            eof = true
            return chunkWriter.toByteArray()
        }
    
    }
like image 30
Seb Avatar answered Sep 20 '22 21:09

Seb