Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Practices to Create and Download a huge ZIP (from several BLOBs) in a WebApp

I will need to perform a massive download of files from my Web Application.

It is obviously expected to be a long-running action (it'll be used once-per-year[-per-customer]), so the time is not a problem (unless it hits some timeout, but I can handle that by creating some form of keepalive heartbeating). I know how to create an hidden iframe and use it with content-disposition: attachment to attempt to download the file instead of opening it inside the browser, and how to instance a client-server communication for drawing a progress meter;

The actual size of the download (and the number of files) is unknown, but for simplicity we can virtually consider it as 1GB, composed of 100 files, each 10MB.

Since this should be a one-click operation, my first thought was to group all the files, while reading them from the database, in a dynamically generated ZIP, then ask the user to save the ZIP.

The question is: what are the best practices, and what are the known drawbacks and traps, in creating a huge archive from multiple small byte arrays in a WebApp?

That can be randomly split into:

  • should each byte array be converted in a physical temp file, or can they be added to the ZIP in memory ?
  • if yes, I know I'll have to handle the possible equality of names (they can have the same name in different records in the database, but not inside the same file system nor ZIP): are there any other possible problems that come to mind (assuming the file system always has enough physical space) ?
  • since I can't rely on having enough RAM to perform the whole operation in memory, I guess the ZIP should be created and fed to the file system before being sent to the user; is there any way to do it differently (eg with websocket), like asking the user where to save the file, and then starting a constant flow of data from the server to client (Sci-Fi I guess) ?
  • any other related known problems or best practices that cross your mind would be greatly appreciated.
like image 202
Andrea Ligios Avatar asked May 16 '13 10:05

Andrea Ligios


1 Answers

Kick-off example of a totally dynamic ZIP file created by streaming each BLOB from the database directly to the client's File System.

Tested with huge archives with the following performances:

  • Server disk space cost: 0 MegaBytes
  • Server RAM cost: ~ xx Megabytes. the memory consumption is not testable (or at least I don't know how to do it properly), because I got different, apparently random results from running the same routine multiple times (by using Runtime.getRuntime().freeMemory()) before, during and after the loop). However, the memory consumption is lower than using byte[], and that's enough.


FileStreamDto.java using InputStream instead of byte[]

public class FileStreamDto implements Serializable {
    @Getter @Setter private String filename;
    @Getter @Setter private InputStream inputStream; 
}


Java Servlet (or Struts2 Action)

/* Read the amount of data to be streamed from Database to File System,
   summing the size of all Oracle's BLOB, PostgreSQL's ABYTE etc: 
   SELECT sum(length(my_blob_field)) FROM my_table WHERE my_conditions
*/          
Long overallSize = getMyService().precalculateZipSize();

// Tell the browser is a ZIP
response.setContentType("application/zip"); 
// Tell the browser the filename, and that it needs to be downloaded instead of opened
response.addHeader("Content-Disposition", "attachment; filename=\"myArchive.zip\"");        
// Tell the browser the overall size, so it can show a realistic progressbar
response.setHeader("Content-Length", String.valueOf(overallSize));      

ServletOutputStream sos = response.getOutputStream();       
ZipOutputStream zos = new ZipOutputStream(sos);

// Set-up a list of filenames to prevent duplicate entries
HashSet<String> entries = new HashSet<String>();

/* Read all the ID from the interested records in the database, 
   to query them later for the streams: 
   SELECT my_id FROM my_table WHERE my_conditions */           
List<Long> allId = getMyService().loadAllId();

for (Long currentId : allId){
    /* Load the record relative to the current ID:         
       SELECT my_filename, my_blob_field FROM my_table WHERE my_id = :currentId            
       Use resultset.getBinaryStream("my_blob_field") while mapping the BLOB column */
    FileStreamDto fileStream = getMyService().loadFileStream(currentId);

    // Create a zipEntry with a non-duplicate filename, and add it to the ZipOutputStream
    ZipEntry zipEntry = new ZipEntry(getUniqueFileName(entries,fileStream.getFilename()));
    zos.putNextEntry(zipEntry);

    // Use Apache Commons to transfer the InputStream from the DB to the OutputStream
    // on the File System; at this moment, your file is ALREADY being downloaded and growing
    IOUtils.copy(fileStream.getInputStream(), zos);

    zos.flush();
    zos.closeEntry();

    fileStream.getInputStream().close();                    
}

zos.close();
sos.close();    


Helper method for handling duplicate entries

private String getUniqueFileName(HashSet<String> entries, String completeFileName){                         
    if (entries.contains(completeFileName)){                                                
        int extPos = completeFileName.lastIndexOf('.');
        String extension = extPos>0 ? completeFileName.substring(extPos) : "";          
        String partialFileName = extension.length()==0 ? completeFileName : completeFileName.substring(0,extPos);
        int x=1;
        while (entries.contains(completeFileName = partialFileName + "(" + x + ")" + extension))
            x++;
    } 
    entries.add(completeFileName);
    return completeFileName;
}



Thanks a lot @prunge for giving me the idea of the direct streaming.

like image 83
Andrea Ligios Avatar answered Oct 22 '22 12:10

Andrea Ligios