Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pythonic way to send contents of a file to a pipe and count # lines in a single step

given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).

I've managed to do this using awk but I'd like to know if a more pythonic way exists.

#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path

the_file = "/path/to/file/myfile.gz"

outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)

The pipe is later consumed by a call to teradata's fastload, which reads from

"/dev/fd/" + str(zcat_proc.stdout.fileno())

This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.

like image 669
Neil Kodner Avatar asked Dec 22 '22 04:12

Neil Kodner


2 Answers

There's no need for either of zcat or Awk. Counting the lines in a gzipped file can be done with

import gzip

nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))

If you want to do something else with the lines, such as pass them to a different process, do

nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
    nlines += 1
    # pass the line to the other process
like image 185
Fred Foo Avatar answered May 09 '23 21:05

Fred Foo


Counting lines and unzipping gzip-compressed files can be easily done with Python and its standard library. You can do everything in a single pass:

import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
                            stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
    for i, line in enumerate(f):
         fastload_fifo.write(line)
    print "Number of lines", i + 1
os.unlink(fifo_path)

I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.

like image 26
Sven Marnach Avatar answered May 09 '23 23:05

Sven Marnach