I am trying to extract 10'000 first lines from a bz2 file.
import bz2
file = "file.bz2"
file_10000 = "file.txt"
output_file = codecs.open(file_10000,'w+','utf-8')
source_file = bz2.open(file, "r")
count = 0
for line in source_file:
count += 1
if count < 10000:
output_file.writerow(line)
But I get an error "'module' object has no attribute 'open'". Do you have any ideas? Or may be I could save 10'000 first lines to a txt file in some other way? I am on Windows.
Open() This function opens a bzip2 compressed file and returns a file object. The file can be opened as binary/text mode with read/write permission. The function performs compression based on compressionlevel argument between 1 to 9.
The bzgrep utility is used to invoke the grep utility on bzip2 compressed files. All options specified are passed directly to grep. If no file is specified, the standard input is decompressed if necessary and fed to grep. Otherwise, the given files are decompressed (if necessary) and fed to grep.
With the help of bz2. decompress(s) method, we can decompress the compressed bytes of string into original string by using bz2. decompress(s) method. Return : Return decompressed string.
Here is a fully working example that includes writing and reading a test file that is much smaller than your 10000 lines. Its nice to have working examples in questions so we can test easily.
import bz2
import itertools
import codecs
file = "file.bz2"
file_10000 = "file.txt"
# write test file with 9 lines
with bz2.BZ2File(file, "w") as fp:
fp.write('\n'.join('123456789'))
# the original script using BZ2File ... and 3 lines for test
# ...and fixing bugs:
# 1) it only writes 9999 instead of 10000
# 2) files don't do writerow
# 3) close the files
output_file = codecs.open(file_10000,'w+','utf-8')
source_file = bz2.BZ2File(file, "r")
count = 0
for line in source_file:
count += 1
if count <= 3:
output_file.write(line)
source_file.close()
output_file.close()
# show what you got
print('---- Test 1 ----')
print(repr(open(file_10000).read()))
A more efficient way to do it is to break out of the for
loop after reading the lines you want. you can even leverage iterators to thin up the code like so:
# a faster way to read first 3 lines
with bz2.BZ2File(file) as source_file,\
codecs.open(file_10000,'w+','utf-8') as output_file:
output_file.writelines(itertools.islice(source_file, 3))
# show what you got
print('---- Test 2 ----')
print(repr(open(file_10000).read()))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With