how to concatenate multiple files for stdin of Popen

Tags:

I'm porting a bash script to python 2.6, and want to replace some code:

cat $( ls -tr xyz_`date +%F`_*.log ) | filter args > bzip2

I guess I want something similar to the "Replacing shell pipe line" example at http://docs.python.org/release/2.6/library/subprocess.html, ala...

p1 = Popen(["filter", "args"], stdin=*?WHAT?*, stdout=PIPE)
p2 = Popen(["bzip2"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

But, I'm not sure how best to provide p1's stdin value so it concatenates the input files. Seems I could add...

p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = ... stdin=p0.stdout ...

...but that seems to be crossing beyond use of (slow, inefficient) pipes to call external programs with significant functionality. (Any decent shell performs the cat internally.)

So, I can imagine a custom class that satisfies the file object API requirements and can therefore be used for p1's stdin, concatenating arbitrary other file objects. (EDIT: existing answers explain why this isn't possible)

Does python 2.6 have a mechanism addressing this need/want, or might another Popen to cat be considered perfectly fine in python circles?

Thanks.

625

asked May 18 '11 09:05

Tony Delroy

2 Answers

You can replace everything that you're doing with Python code, except for your external utility. That way your program will remain portable as long as your external util is portable. You can also consider turning the C++ program into a library and using Cython to interface with it. As Messa showed, date is replaced with time.strftime, globbing is done with glob.glob and cat can be replaced with reading all the files in the list and writing them to the input of your program. The call to bzip2 can be replaced with the bz2 module, but that will complicate your program because you'd have to read and write simultaneously. To do that, you need to either use p.communicate or a thread if the data is huge (select.select would be a better choice but it won't work on Windows).

import sys
import bz2
import glob
import time
import threading
import subprocess

output_filename = '../whatever.bz2'
input_filenames = glob.glob(time.strftime("xyz_%F_*.log"))
p = subprocess.Popen(['filter', 'args'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
output = open(output_filename, 'wb')
output_compressor = bz2.BZ2Compressor()

def data_reader():
    for filename in input_filenames:
        f = open(filename, 'rb')
        p.stdin.writelines(iter(lambda: f.read(8192), ''))
    p.stdin.close()

input_thread = threading.Thread(target=data_reader)
input_thread.start()

with output:
    for chunk in iter(lambda: p.stdout.read(8192), ''):
        output.write(output_compressor.compress(chunk))

    output.write(output_compressor.flush())

input_thread.join()
p.wait()

Addition: How to detect file input type

You can use either the file extension or the Python bindings for libmagic to detect how the file is compressed. Here's a code example that does both, and automatically chooses magic if it is available. You can take the part that suits your needs and adapt it to your needs. The open_autodecompress should detect the mime encoding and open the file with the appropriate decompressor if it is available.

import os
import gzip
import bz2
try:
    import magic
except ImportError:
    has_magic = False
else:
    has_magic = True


mime_openers = {
    'application/x-bzip2': bz2.BZ2File,
    'application/x-gzip': gzip.GzipFile,
}

ext_openers = {
    '.bz2': bz2.BZ2File,
    '.gz': gzip.GzipFile,
}


def open_autodecompress(filename, mode='r'):
    if has_magic:
        ms = magic.open(magic.MAGIC_MIME_TYPE)
        ms.load()
        mimetype = ms.file(filename)
        opener = mime_openers.get(mimetype, open)
    else:
        basepart, ext = os.path.splitext(filename)
        opener = ext_openers.get(ext, open)
    return opener(filename, mode)

121

answered Oct 22 '22 18:10

Rosh Oxymoron

If you look inside the subprocess module implementation, you will see that std{in,out,err} are expected to be fileobjects supporting fileno() method, so a simple concatinating file-like object with python interface (or even a StringIO object) is not suitable here.

If it were iterators, not file objects, you could use itertools.chain.

Of course, sacrificing the memory consumption you can do something like this:

import itertools, os

# ...

files = [f for f in os.listdir(".") if os.path.isfile(f)]
input = ''.join(itertools.chain(open(file) for file in files))
p2.communicate(input)

answered Oct 22 '22 17:10

abbot

Related questions
                            
                                Why shouldn't I use async (evented) IO
                            
                                Fast way to read interleaved data?
                            
                                Twisted, gevent eventlet - When would I use them
                            
                                in Python, how to separate Local Hard Drives from Network and Floppy in Windows?
                            
                                How to lazy load a data structure (python)
                            
                                Django Python Garbage Collection woes
                            
                                Installing pyOpenSSL on Amazon Linux (EC2)
                            
                                Comparing/Clustering Trajectories (GPS data of (x,y) points) and Mining the data
                            
                                Build HTTP GET request with port number and parameters
                            
                                Trying to get Scrapy into a project to run Crawl command
                            
                                When using lxml, can the XML be rendered without namespace attributes?
                            
                                "getaddrinfo" error when trying to establish an SSH connection using Python Paramiko
                            
                                Close urllib2 connection
                            
                                Python: Tokenizing with phrases
                            
                                Creating a stream class in Python
                            
                                how to parse big datasets using RDFLib?
                            
                                How to fix shebang flags that are not recognized on some systems
                            
                                What is the standard Python module for SQL queries?
                            
                                Buildout tries to update system-wide Distribute installation and refuses to run
                            
                                how to share global variables across threads in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to concatenate multiple files for stdin of Popen

Tags:

python

concatenation

pipe

popen