Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly handle multiple binary files in python?

Tags:

python

pycurl

I'm currently working on a multi-threaded downloader with help of PycURL module. I am downloading parts of the files and merging them afterwards.

The parts are downloaded separately from multiple threads , they are written to temporary files in binary mode, but when I merge them into single file(they are merged in correct order) , the checksums do not match.

This only happens in linux env. The same script works flawlessly in Windows env.

This is the code(part of the script) that merges the files:

with open(filename,'wb') as outfile:
    print('Merging temp files ...')
    for tmpfile in self.tempfile_arr:
        with open(tmpfile, 'rb') as infile:
            shutil.copyfileobj(infile, outfile)
    print('Done!')

I tried write() method as well , but it results with same issue, and it will take a lot of memory for large files.

If I manually cat the part files into a single file in linux, then file's checksum matches, the issue is with python's merging of files.

EDIT:
Here are the files and checksums(sha256) that I used to reproduce the issue:

  • Original file
    • HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
  • file merged by script
    • HASH: c3e5a0404da480f36d37b65053732abe6d19034f60c3004a908b88d459db7d87
  • file merged manually using cat

    • HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
    • Command used:

      for i in /tmp/pycurl_*_{0..7}; do cat $i >> manually_merged.tar.gz; done
      
  • Part files - numbered at the end, from 0 through 7

like image 292
Saumyakanta Sahoo Avatar asked Dec 28 '19 16:12

Saumyakanta Sahoo


People also ask

Can Python handle binary files?

Python File I/O - Read and Write Files. In Python, the IO module provides methods of three types of IO operations; raw binary files, buffered binary files, and text files. The canonical way to create a file object is by using the open() function.

Which module is needed for binary files in Python?

In Python, the struct module is used to read and save packed binary data. This module contains a number of methods that allow you to get a packed object on a specified format string.


1 Answers

A minimally reproducible case would be convenient, but I'd suspect universal newlines to be the issue: by default, if your files are windows-style text (newlines are \r\n) they're going to get translated to Unix-style newlines (\n) on reading. And then those unix-style newlines are going to get written back to the output file rather than the Windows-style ones you were expecting. That would explain the divergence between python and cat (which'd do no translation whatsoever).

Try to run your script passing newline='' (the empty string) to open.

like image 57
Masklinn Avatar answered Sep 24 '22 18:09

Masklinn