Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use yaml.load_all with fileinput.input?

Tags:

python

pyyaml

Without resorting to ''.join, is there a Pythonic way to use PyYAML's yaml.load_all with fileinput.input() for easy streaming of multiple documents from multiple sources?

I'm looking for something like the following (non-working example):

# example.py
import fileinput

import yaml

for doc in yaml.load_all(fileinput.input()):
    print(doc)

Expected output:

$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc

Of course, yaml.load_all expects either a string, bytes, or a file-like object and fileinput.input() is none of those things, so the above example does not work.

Actual output:

$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'

You can make the example work with ''.join, but that's cheating. I'm looking for a way that does not read the entire stream into memory at once.

We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? However, I doubt that yaml.load_all actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary.

Ideally I'm looking for the minimal adapter that would support something like this:

for doc in yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)
like image 400
CJ Gaconnet Avatar asked Sep 06 '16 23:09

CJ Gaconnet


People also ask

What does import Fileinput do?

The fileinput. input() function takes as argument a list of filenames to examine. If the list is empty, the module reads data from standard input. The function returns an iterator which returns individual lines from the text files being processed.

How do I dump a YAML file in Python?

Open the empty Python file within the text editor and start to code within it. We add the python path within this code in the first line. The code is initiated with the simple import of the “yaml” repository to use the “yaml” related functions within the code, i.e. “dump()” function.


2 Answers

The problem with fileinput.input is that the resulting object doesn't have a read method, which is what yaml.load_all is looking for. If you're willing to give up fileinput, you can just write your own class that will do what you want:

import sys                                                                      
import yaml                                                                     

class BunchOFiles (object):                                                     
    def __init__(self, *files):                                                 
        self.files = files                                                      
        self.fditer = self._fditer()                                            
        self.fd = self.fditer.next()                                            

    def _fditer(self):                                                          
        for fn in self.files:                                                   
            with sys.stdin if fn == '-' else open(fn, 'r') as fd:               
                yield fd                                                        

    def read(self, size=-1):                                                    
        while True:                                                             
            data = self.fd.read(size)                                           

            if data:                                                            
                break                                                           
            else:                                                               
                try:                                                            
                    self.fd = self.fditer.next()                                
                except StopIteration:                                           
                    self.fd = None                                              
                    break                                                       

        return data                                                             

bunch = BunchOFiles(*sys.argv[1:])                                              
for doc in yaml.load_all(bunch):                                                
    print doc                                                                   

The BunchOFiles class gets you an object with a read method that will happily iterate over a list of files until everything is exhausted. Given the above code and your sample input, we get exactly the output you're looking for.

like image 183
larsks Avatar answered Sep 19 '22 00:09

larsks


Your minimal_adapter should take a fileinput.FileInput as a parameter and return an object which load_all can use. load_all either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read() method.

Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__ method, and have that method return the instance and store its argument for future use. Implemented that way, the class should also have a read() method, as this will be called after handing the instance to load_all:

import fileinput
import ruamel.yaml


class MinimalAdapter:
    def __init__(self):
        self._fip = None
        self._buf = None  # storage of read but unused material, maximum one line

    def __call__(self, fip):
        self._fip = fip  # store for future use
        self._buf = ""
        return self

    def read(self, size):
        if len(self._buf) >= size:
            # enough in buffer from last read, just cut it off and return
            tmp, self._buf = self._buf[:size], self._buf[size:]
            return tmp
        for line in self._fip:
            self._buf += line
            if len(self._buf) > size:
                break
        else:
            # ran out of lines, return what we have
            tmp, self._buf = self._buf, ''
            return tmp
        tmp, self._buf = self._buf[:size], self._buf[size:]
        return tmp


minimal_adapter = MinimalAdapter()

for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

With this, running your example invocation exactly gives the output that you want.

This is probably only more memory efficient for larger files. The load_all tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read()) and fileinput does some buffering as well (use strace if your interested to find out how it behaves).


This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.

like image 40
Anthon Avatar answered Sep 20 '22 00:09

Anthon