Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to concatenate multiple files for stdin of Popen

I'm porting a bash script to python 2.6, and want to replace some code:

cat $( ls -tr xyz_`date +%F`_*.log ) | filter args > bzip2

I guess I want something similar to the "Replacing shell pipe line" example at http://docs.python.org/release/2.6/library/subprocess.html, ala...

p1 = Popen(["filter", "args"], stdin=*?WHAT?*, stdout=PIPE)
p2 = Popen(["bzip2"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

But, I'm not sure how best to provide p1's stdin value so it concatenates the input files. Seems I could add...

p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = ... stdin=p0.stdout ...

...but that seems to be crossing beyond use of (slow, inefficient) pipes to call external programs with significant functionality. (Any decent shell performs the cat internally.)

So, I can imagine a custom class that satisfies the file object API requirements and can therefore be used for p1's stdin, concatenating arbitrary other file objects. (EDIT: existing answers explain why this isn't possible)

Does python 2.6 have a mechanism addressing this need/want, or might another Popen to cat be considered perfectly fine in python circles?

Thanks.

like image 625
Tony Delroy Avatar asked May 18 '11 09:05

Tony Delroy


People also ask

What does the popen () return on success?

If successful, popen() returns a pointer to an open stream that can be used to read or write to a pipe. If unsuccessful, popen() returns a NULL pointer and sets errno to one of the following values: Error Code. Description.

Do you need to close Popen?

Popen do we need to close the connection or subprocess automatically closes the connection? Usually, the examples in the official documentation are complete. There the connection is not closed. So you do not need to close most probably.

How does subprocess Popen work?

The subprocess module defines one class, Popen and a few wrapper functions that use that class. The constructor for Popen takes arguments to set up the new process so the parent can communicate with it via pipes. It provides all of the functionality of the other modules and functions it replaces, and more.

What is Popen in Python?

Python method popen() opens a pipe to or from command. The return value is an open file object connected to the pipe, which can be read or written depending on whether mode is 'r' (default) or 'w'.


2 Answers

You can replace everything that you're doing with Python code, except for your external utility. That way your program will remain portable as long as your external util is portable. You can also consider turning the C++ program into a library and using Cython to interface with it. As Messa showed, date is replaced with time.strftime, globbing is done with glob.glob and cat can be replaced with reading all the files in the list and writing them to the input of your program. The call to bzip2 can be replaced with the bz2 module, but that will complicate your program because you'd have to read and write simultaneously. To do that, you need to either use p.communicate or a thread if the data is huge (select.select would be a better choice but it won't work on Windows).

import sys
import bz2
import glob
import time
import threading
import subprocess

output_filename = '../whatever.bz2'
input_filenames = glob.glob(time.strftime("xyz_%F_*.log"))
p = subprocess.Popen(['filter', 'args'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
output = open(output_filename, 'wb')
output_compressor = bz2.BZ2Compressor()

def data_reader():
    for filename in input_filenames:
        f = open(filename, 'rb')
        p.stdin.writelines(iter(lambda: f.read(8192), ''))
    p.stdin.close()

input_thread = threading.Thread(target=data_reader)
input_thread.start()

with output:
    for chunk in iter(lambda: p.stdout.read(8192), ''):
        output.write(output_compressor.compress(chunk))

    output.write(output_compressor.flush())

input_thread.join()
p.wait()

Addition: How to detect file input type

You can use either the file extension or the Python bindings for libmagic to detect how the file is compressed. Here's a code example that does both, and automatically chooses magic if it is available. You can take the part that suits your needs and adapt it to your needs. The open_autodecompress should detect the mime encoding and open the file with the appropriate decompressor if it is available.

import os
import gzip
import bz2
try:
    import magic
except ImportError:
    has_magic = False
else:
    has_magic = True


mime_openers = {
    'application/x-bzip2': bz2.BZ2File,
    'application/x-gzip': gzip.GzipFile,
}

ext_openers = {
    '.bz2': bz2.BZ2File,
    '.gz': gzip.GzipFile,
}


def open_autodecompress(filename, mode='r'):
    if has_magic:
        ms = magic.open(magic.MAGIC_MIME_TYPE)
        ms.load()
        mimetype = ms.file(filename)
        opener = mime_openers.get(mimetype, open)
    else:
        basepart, ext = os.path.splitext(filename)
        opener = ext_openers.get(ext, open)
    return opener(filename, mode)
like image 121
Rosh Oxymoron Avatar answered Oct 22 '22 18:10

Rosh Oxymoron


If you look inside the subprocess module implementation, you will see that std{in,out,err} are expected to be fileobjects supporting fileno() method, so a simple concatinating file-like object with python interface (or even a StringIO object) is not suitable here.

If it were iterators, not file objects, you could use itertools.chain.

Of course, sacrificing the memory consumption you can do something like this:

import itertools, os

# ...

files = [f for f in os.listdir(".") if os.path.isfile(f)]
input = ''.join(itertools.chain(open(file) for file in files))
p2.communicate(input)
like image 26
abbot Avatar answered Oct 22 '22 17:10

abbot