Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading first lines of bz2 files in python

Tags:

python

bz2

I am trying to extract 10'000 first lines from a bz2 file.

   import bz2       
   file = "file.bz2"
   file_10000 = "file.txt"

   output_file = codecs.open(file_10000,'w+','utf-8')

   source_file = bz2.open(file, "r")
   count = 0
   for line in source_file:
       count += 1
       if count < 10000:
           output_file.writerow(line)

But I get an error "'module' object has no attribute 'open'". Do you have any ideas? Or may be I could save 10'000 first lines to a txt file in some other way? I am on Windows.

like image 522
student Avatar asked May 11 '16 20:05

student


People also ask

How do I read a bz2 file in Python?

Open() This function opens a bzip2 compressed file and returns a file object. The file can be opened as binary/text mode with read/write permission. The function performs compression based on compressionlevel argument between 1 to 9.

How do I grep in bz2 files?

The bzgrep utility is used to invoke the grep utility on bzip2 compressed files. All options specified are passed directly to grep. If no file is specified, the standard input is decompressed if necessary and fed to grep. Otherwise, the given files are decompressed (if necessary) and fed to grep.

How do I unzip a bz2 file in Python?

With the help of bz2. decompress(s) method, we can decompress the compressed bytes of string into original string by using bz2. decompress(s) method. Return : Return decompressed string.


1 Answers

Here is a fully working example that includes writing and reading a test file that is much smaller than your 10000 lines. Its nice to have working examples in questions so we can test easily.

import bz2
import itertools
import codecs

file = "file.bz2"
file_10000 = "file.txt"

# write test file with 9 lines
with bz2.BZ2File(file, "w") as fp:
    fp.write('\n'.join('123456789'))

# the original script using BZ2File ... and 3 lines for test
# ...and fixing bugs:
#     1) it only writes 9999 instead of 10000
#     2) files don't do writerow
#     3) close the files

output_file = codecs.open(file_10000,'w+','utf-8')

source_file = bz2.BZ2File(file, "r")
count = 0
for line in source_file:
    count += 1
    if count <= 3:
       output_file.write(line)
source_file.close()
output_file.close()

# show what you got
print('---- Test 1 ----')
print(repr(open(file_10000).read()))   

A more efficient way to do it is to break out of the for loop after reading the lines you want. you can even leverage iterators to thin up the code like so:

# a faster way to read first 3 lines
with bz2.BZ2File(file) as source_file,\
        codecs.open(file_10000,'w+','utf-8') as output_file:
    output_file.writelines(itertools.islice(source_file, 3))

# show what you got
print('---- Test 2 ----')
print(repr(open(file_10000).read()))   
like image 79
tdelaney Avatar answered Oct 04 '22 05:10

tdelaney