Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to retrieve a single 7zip file without extracting all of it in Python3.x?

In Python, I want to browse all the sub directories in and only selectively extract a 7z file after checking its content. I do not want to extract all the files but I should be able to peep into the content iteratively/ recursively.

The main concern is the .7z zip is of size 15 GB but when it is unzipped it is 225 GB. Now my hard disk is 160 GB. Of those 225 GB I might need only valid 60 GB data. I can search for that only if I can go through the data in the individual file. Is there any os.walk kind of function on .7z file ?

https://dumps.wikimedia.org/other/static_html_dumps/current/en/*.7z is the file, I am exploring.

7z l *.7z

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz (406E3),ASM,AES-NI)

Scanning the drive for archives:
1 file, 15363543213 bytes (15 GiB)

Listing archive: wikipedia-en-html.tar.7z

--
Path = wikipedia-en-html.tar.7z
Type = 7z
Physical Size = 15363543213
Headers Size = 100
Method = LZMA:22
Solid = -
Blocks = 1

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2008-06-18 23:32:15 ..... 223674511360  15363543113  wikipedia-en-html.tar
------------------- ----- ------------ ------------  ------------------------
2008-06-18 23:32:15       223674511360  15363543113  1 files
import lzma

f7file = r"C:\Users\padmaraj.bhat\OneDrive - Accenture\Downloads\wiki-html\wikipedia-en-html.tar.7z"

f = lzma.open(f7file, 'rb')
for line in f:
    lzma.decompress(line)
    break

Traceback (most recent call last)
  <ipython-input-5-d1a496a0c194> in <module>()
      4 
      5 f = lzma.open(f7file, 'rb')
----> 6 for line in f:
      7     lzma.decompress(line)
      8     break

  ~\AppData\Local\Continuum\anaconda3\lib\lzma.py in readline(self, size)
    220         """
    221         self._check_can_read()
--> 222         return self._buffer.readline(size)
    223 
    224     def write(self, data):

  ~\AppData\Local\Continuum\anaconda3\lib\_compression.py in readinto(self, b)
     66     def readinto(self, b):
     67         with memoryview(b) as view, view.cast("B") as byte_view:
---> 68             data = self.read(len(byte_view))
     69             byte_view[:len(data)] = data
     70         return len(data)

  ~\AppData\Local\Continuum\anaconda3\lib\_compression.py in read(self, size)
    101                 else:
    102                     rawblock = b""
--> 103                 data = self._decompressor.decompress(rawblock, size)
    104             if data:
    105                 break

LZMAError: Input format not supported by decoder
like image 561
Padmaraj Bhat Avatar asked Oct 16 '25 19:10

Padmaraj Bhat


1 Answers

When I had to do something like that, I had to call the 7z CLI via subprocess(). In this way, you can determine file lists as well as file contents from the archive.

For example, for extracting files directly to stdout, you can use the -so option.

like image 84
glglgl Avatar answered Oct 18 '25 07:10

glglgl