I have multiple gz files with a total size of around 120GB. I want to unzip(gzip) those files to the same directory and remove the existing gz file. Currently we are doing it manually and it is taking more time to unzip using gzip -d <filename>
.
Is there a way I can unzip those files in parallel by creating a python script or any other technique. Currently these files are on a Linux machine.
First, get the list of all files. Then iterate over each file and append to a result file. Then extract is using zipfile lib. You could use tempfile to avoid handle with temporary zip file.
To unzip a file in Python, use the ZipFile. extractall() method. The extractall() method takes a path, members, pwd as an argument and extracts all the contents.
how can I extract multiple gzip files in directory and subdirectories? It will extract all files with their original names and store them in the current user home directory( /home/username ). gunzip *. gz // This command also will work.
To lessen storage requirements. To develop transfer speed over standard connections. ZipFile class provides a member function to extract all the data from a ZIP archive. ZIP is the archive file format that supports lossless data compression. In order to unzip a file in Python, make use of this ZipFile.extractall () method.
Package zipfile can be used in order to extract files from zip archive for Python. Basic usage is shown below: For Tar/Tar.gz files we can use the code below in order to extract the files. It uses module - tarfile and differs the two types in order to use proper extraction mode:
If you like to extract only one file from it - indexProcessed.csv then you can use next Python snippet: import zipfile path = '/home/myuser/Downloads/' archive = zipfile.ZipFile (f' {path}archive.zip') for file in archive.namelist (): if file.startswith ('indexProcessed.csv'): archive.extract (file, path)
gunzip *.gz gzip: invalid option -- 'Y' gunzip -S-1800-01-01-000000-g01.h5.gz gzip: compressed data not read from a terminal. Use -f to force decompression. For help, type: gzip -h
You can do this very easily with multiprocessing Pools:
import gzip
import multiprocessing
import shutil
filenames = [
'a.gz',
'b.gz',
'c.gz',
...
]
def uncompress(path):
with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
shutil.copyfileobj(src, dest)
with multiprocessing.Pool() as pool:
for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
pass
This code will spawn a few processes, and each process will extract one file at a time.
Here I've chosen chunksize=1
, to avoid stalling processes if some files are bigger than average.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With