Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to unzip multiple gz files in python using multi threading?

I have multiple gz files with a total size of around 120GB. I want to unzip(gzip) those files to the same directory and remove the existing gz file. Currently we are doing it manually and it is taking more time to unzip using gzip -d <filename>.
Is there a way I can unzip those files in parallel by creating a python script or any other technique. Currently these files are on a Linux machine.

like image 231
Satheesh Avatar asked Dec 24 '15 10:12

Satheesh


People also ask

How do I extract multiple GZ files in Python?

First, get the list of all files. Then iterate over each file and append to a result file. Then extract is using zipfile lib. You could use tempfile to avoid handle with temporary zip file.

How to unzip files in Python?

To unzip a file in Python, use the ZipFile. extractall() method. The extractall() method takes a path, members, pwd as an argument and extracts all the contents.

How unzip multiple GZ file in Linux?

how can I extract multiple gzip files in directory and subdirectories? It will extract all files with their original names and store them in the current user home directory( /home/username ). gunzip *. gz // This command also will work.

How to unzip a file in Python using zipfile?

To lessen storage requirements. To develop transfer speed over standard connections. ZipFile class provides a member function to extract all the data from a ZIP archive. ZIP is the archive file format that supports lossless data compression. In order to unzip a file in Python, make use of this ZipFile.extractall () method.

How to extract files from ZIP archive for Python?

Package zipfile can be used in order to extract files from zip archive for Python. Basic usage is shown below: For Tar/Tar.gz files we can use the code below in order to extract the files. It uses module - tarfile and differs the two types in order to use proper extraction mode:

How to extract only one file from a zip file?

If you like to extract only one file from it - indexProcessed.csv then you can use next Python snippet: import zipfile path = '/home/myuser/Downloads/' archive = zipfile.ZipFile (f' {path}archive.zip') for file in archive.namelist (): if file.startswith ('indexProcessed.csv'): archive.extract (file, path)

How to force decompression of gzip file?

gunzip *.gz gzip: invalid option -- 'Y' gunzip -S-1800-01-01-000000-g01.h5.gz gzip: compressed data not read from a terminal. Use -f to force decompression. For help, type: gzip -h


1 Answers

You can do this very easily with multiprocessing Pools:

import gzip
import multiprocessing
import shutil

filenames = [
    'a.gz',
    'b.gz',
    'c.gz',
    ...
]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)

with multiprocessing.Pool() as pool:
    for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
        pass

This code will spawn a few processes, and each process will extract one file at a time.

Here I've chosen chunksize=1, to avoid stalling processes if some files are bigger than average.

like image 61
Andrea Corbellini Avatar answered Sep 28 '22 06:09

Andrea Corbellini