Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split csv file into multiple files by size

in a java project i generate a big csv file (about 500 Mb), and i need to split that file into multiple files of at most 10 Mb size each one. I found a lot of posts similar but any of them answer to my question because in all posts the java code split the original files in exactly 10 Mb files, and (obviously) truncate records. Instead i need each record is complete, intact. Any record should be truncated. If i'm copying a record from the original big csv file to one generated file, and the file dimension will overflow 10 Mb if i copy the record, i should be able to not copy that record, close that file, create a new file and copy the record in the new one. Is it possible? Can someone help me? Thank you!

I tried this code:

File f = new File("/home/luca/Desktop/test/images.csv");
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(f));
FileOutputStream out;
String name = f.getName();
int partCounter = 1;
int sizeOfFiles = 10 * 1024 * 1024;// 1MB
byte[] buffer = new byte[sizeOfFiles];
int tmp = 0;
while ((tmp = bis.read(buffer)) > 0) {
 File newFile=new File("/home/luca/Desktop/test/"+name+"."+String.format("%03d", partCounter++));
 newFile.createNewFile();
 out = new FileOutputStream(newFile);
 out.write(buffer,0,tmp);
 out.close();
}

But obviously doesn't work. This code split a source file in n 10Mb files truncating records. In my case my csv file has 16 columns so with the procedure above i have for example the last record has only 5 columns populated. The others are truncated.

SOLUTION Here the code i wrote.

FileReader fileReader = new FileReader("/home/luca/Desktop/test/images.csv");
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line="";
int fileSize = 0;
BufferedWriter fos = new BufferedWriter(new FileWriter("/home/luca/Desktop/test/images_"+new Date().getTime()+".csv",true));
while((line = bufferedReader.readLine()) != null) {
    if(fileSize + line.getBytes().length > 9.5 * 1024 * 1024){
        fos.flush();
        fos.close();
        fos = new BufferedWriter(new FileWriter("/home/luca/Desktop/test/images_"+new Date().getTime()+".csv",true));
        fos.write(line+"\n");
        fileSize = line.getBytes().length;
    }else{
        fos.write(line+"\n");
        fileSize += line.getBytes().length;
    }
}          
fos.flush();
fos.close();
bufferedReader.close();

This code read a csv file and split it to n files, each file is at most 10 Mb big and each csv line is completely copied or not copied at all.

like image 242
lucavenanzetti Avatar asked Oct 20 '22 22:10

lucavenanzetti


1 Answers

In principle very simple.

You create a buffer of 10MB (byte[]) and read as many bytes as you can from the source. Then you search from the back for a line feed. The portion from the beginning of the buffer to the line feed = new file. You retain the part you have read in excess and copy it to start of buffer (offset 0). The you repeat everything until no more source.

like image 118
Durandal Avatar answered Oct 23 '22 23:10

Durandal