Without resorting to <code>''.join</code>, is there a Pythonic way to use PyYAML's <code>yaml.load_all</code> with <code>fileinput.input()</code> for easy streaming of multiple documents from multiple sources? I'm looking for something like the following (non-working example): <pre class="prettyprint"><code># example.py import fileinput import yaml for doc in yaml.load_all(fileinput.input()): print(doc) </code></pre> Expected output: <pre class="prettyprint"><code>$ cat >pre.yaml <<<'--- prefix-doc' $ cat >post.yaml <<<'--- postfix-doc' $ python example.py pre.yaml - post.yaml <<<'--- hello' prefix-doc hello postfix-doc </code></pre> Of course, <code>yaml.load_all</code> expects either a string, bytes, or a file-like object and <code>fileinput.input()</code> is none of those things, so the above example does not work. Actual output: <pre class="prettyprint"><code>$ python example.py pre.yaml - post.yaml <<<'--- hello' ... AttributeError: FileInput instance has no attribute 'read' </code></pre> You can make the example work with <code>''.join</code>, but that's cheating. I'm looking for a way that does not read the entire stream into memory at once. We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? However, I doubt that <code>yaml.load_all</code> actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary. Ideally I'm looking for the minimal adapter that would support something like this: <pre class="prettyprint"><code>for doc in yaml.load_all(minimal_adapter(fileinput.input())): print(doc) </code></pre>

Your <code>minimal_adapter</code> should take a <code>fileinput.FileInput</code> as a parameter and return an object which <code>load_all</code> can use. <code>load_all</code> either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a <code>read()</code> method. Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a <code>__call__</code> method, and have that method return the instance and store its argument for future use. Implemented that way, the class should also have a <code>read()</code> method, as this will be called after handing the instance to <code>load_all</code>: <pre class="prettyprint"><code>import fileinput import ruamel.yaml class MinimalAdapter: def __init__(self): self._fip = None self._buf = None # storage of read but unused material, maximum one line def __call__(self, fip): self._fip = fip # store for future use self._buf = "" return self def read(self, size): if len(self._buf) >= size: # enough in buffer from last read, just cut it off and return tmp, self._buf = self._buf[:size], self._buf[size:] return tmp for line in self._fip: self._buf += line if len(self._buf) > size: break else: # ran out of lines, return what we have tmp, self._buf = self._buf, '' return tmp tmp, self._buf = self._buf[:size], self._buf[size:] return tmp minimal_adapter = MinimalAdapter() for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())): print(doc) </code></pre> With this, running your example invocation exactly gives the output that you want. This is probably only more memory efficient for larger files. The <code>load_all</code> tries to read 1024 byte blocks at a time (easily found out by putting a print statement in <code>MinimalAdapter.read()</code>) and <code>fileinput</code> does some buffering as well (use <code>strace</code> if your interested to find out how it behaves). <hr> This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.

How to use yaml.load_all with fileinput.input?

Tags:

python

pyyaml

Without resorting to ''.join, is there a Pythonic way to use PyYAML's yaml.load_all with fileinput.input() for easy streaming of multiple documents from multiple sources?

I'm looking for something like the following (non-working example):

# example.py
import fileinput

import yaml

for doc in yaml.load_all(fileinput.input()):
    print(doc)

Expected output:

$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc

Of course, yaml.load_all expects either a string, bytes, or a file-like object and fileinput.input() is none of those things, so the above example does not work.

Actual output:

$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'

You can make the example work with ''.join, but that's cheating. I'm looking for a way that does not read the entire stream into memory at once.

We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? However, I doubt that yaml.load_all actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary.

Ideally I'm looking for the minimal adapter that would support something like this:

for doc in yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

400

asked Sep 06 '16 23:09

CJ Gaconnet

2 Answers

The problem with fileinput.input is that the resulting object doesn't have a read method, which is what yaml.load_all is looking for. If you're willing to give up fileinput, you can just write your own class that will do what you want:

import sys                                                                      
import yaml                                                                     

class BunchOFiles (object):                                                     
    def __init__(self, *files):                                                 
        self.files = files                                                      
        self.fditer = self._fditer()                                            
        self.fd = self.fditer.next()                                            

    def _fditer(self):                                                          
        for fn in self.files:                                                   
            with sys.stdin if fn == '-' else open(fn, 'r') as fd:               
                yield fd                                                        

    def read(self, size=-1):                                                    
        while True:                                                             
            data = self.fd.read(size)                                           

            if data:                                                            
                break                                                           
            else:                                                               
                try:                                                            
                    self.fd = self.fditer.next()                                
                except StopIteration:                                           
                    self.fd = None                                              
                    break                                                       

        return data                                                             

bunch = BunchOFiles(*sys.argv[1:])                                              
for doc in yaml.load_all(bunch):                                                
    print doc

The BunchOFiles class gets you an object with a read method that will happily iterate over a list of files until everything is exhausted. Given the above code and your sample input, we get exactly the output you're looking for.

183

answered Sep 19 '22 00:09

larsks

Your minimal_adapter should take a fileinput.FileInput as a parameter and return an object which load_all can use. load_all either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read() method.

Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__ method, and have that method return the instance and store its argument for future use. Implemented that way, the class should also have a read() method, as this will be called after handing the instance to load_all:

import fileinput
import ruamel.yaml


class MinimalAdapter:
    def __init__(self):
        self._fip = None
        self._buf = None  # storage of read but unused material, maximum one line

    def __call__(self, fip):
        self._fip = fip  # store for future use
        self._buf = ""
        return self

    def read(self, size):
        if len(self._buf) >= size:
            # enough in buffer from last read, just cut it off and return
            tmp, self._buf = self._buf[:size], self._buf[size:]
            return tmp
        for line in self._fip:
            self._buf += line
            if len(self._buf) > size:
                break
        else:
            # ran out of lines, return what we have
            tmp, self._buf = self._buf, ''
            return tmp
        tmp, self._buf = self._buf[:size], self._buf[size:]
        return tmp


minimal_adapter = MinimalAdapter()

for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

With this, running your example invocation exactly gives the output that you want.

This is probably only more memory efficient for larger files. The load_all tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read()) and fileinput does some buffering as well (use strace if your interested to find out how it behaves).

_{This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.}

answered Sep 20 '22 00:09

Anthon

Related questions
                            
                                default() method in Python
                            
                                Getting all attributes to appear on python's `__dict__` method
                            
                                how to find the index for a quantile
                            
                                How to center text horizontally in a Kivy text input?
                            
                                Image to text python
                            
                                Is `if x:` completely equivalent to `if bool(x) is True:`?
                            
                                Named string format arguments in Python
                            
                                How to filter data from a data frame when the number of columns are dynamic?
                            
                                How can I capture a key press (key logging) in Linux?
                            
                                what are the differences between import and extends in Flask?
                            
                                Execute flask-SQLAlchemy subquery
                            
                                How to put a JSON file's content in a response
                            
                                List comprehension works but not for loop––why?
                            
                                Finding the area of intersection of multiple overlapping rectangles in Python
                            
                                Opening a gzip file in python Apache Beam
                            
                                Do locally set Cython compiler directives affect one or all functions?
                            
                                additional column when saving pandas data frame to csv file
                            
                                Pandas Dataframe Line Plot: Show Random Markers
                            
                                Python Pandas read_excel doesn't recognize null cell
                            
                                Run multiple servers in python at same time (Threading)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With