Without resorting to ''.join
, is there a Pythonic way to use PyYAML's yaml.load_all
with fileinput.input()
for easy streaming of multiple documents from multiple sources?
I'm looking for something like the following (non-working example):
# example.py
import fileinput
import yaml
for doc in yaml.load_all(fileinput.input()):
print(doc)
Expected output:
$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc
Of course, yaml.load_all
expects either a string, bytes, or a file-like object and fileinput.input()
is none of those things, so the above example does not work.
Actual output:
$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'
You can make the example work with ''.join
, but that's cheating. I'm looking for a way that does not read the entire stream into memory at once.
We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? However, I doubt that yaml.load_all
actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary.
Ideally I'm looking for the minimal adapter that would support something like this:
for doc in yaml.load_all(minimal_adapter(fileinput.input())):
print(doc)
The fileinput. input() function takes as argument a list of filenames to examine. If the list is empty, the module reads data from standard input. The function returns an iterator which returns individual lines from the text files being processed.
Open the empty Python file within the text editor and start to code within it. We add the python path within this code in the first line. The code is initiated with the simple import of the “yaml” repository to use the “yaml” related functions within the code, i.e. “dump()” function.
The problem with fileinput.input
is that the resulting object doesn't have a read
method, which is what yaml.load_all
is looking for. If you're willing to give up fileinput
, you can just write your own class that will do what you want:
import sys
import yaml
class BunchOFiles (object):
def __init__(self, *files):
self.files = files
self.fditer = self._fditer()
self.fd = self.fditer.next()
def _fditer(self):
for fn in self.files:
with sys.stdin if fn == '-' else open(fn, 'r') as fd:
yield fd
def read(self, size=-1):
while True:
data = self.fd.read(size)
if data:
break
else:
try:
self.fd = self.fditer.next()
except StopIteration:
self.fd = None
break
return data
bunch = BunchOFiles(*sys.argv[1:])
for doc in yaml.load_all(bunch):
print doc
The BunchOFiles
class gets you an object with a read
method that will happily iterate over a list of files until everything is exhausted. Given the above code and your sample input, we get exactly the output you're looking for.
Your minimal_adapter
should take a fileinput.FileInput
as a parameter and return an object which load_all
can use. load_all
either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read()
method.
Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__
method, and have that method return the instance and store its argument for future use. Implemented that way, the class should also have a read()
method, as this will be called after handing the instance to load_all
:
import fileinput
import ruamel.yaml
class MinimalAdapter:
def __init__(self):
self._fip = None
self._buf = None # storage of read but unused material, maximum one line
def __call__(self, fip):
self._fip = fip # store for future use
self._buf = ""
return self
def read(self, size):
if len(self._buf) >= size:
# enough in buffer from last read, just cut it off and return
tmp, self._buf = self._buf[:size], self._buf[size:]
return tmp
for line in self._fip:
self._buf += line
if len(self._buf) > size:
break
else:
# ran out of lines, return what we have
tmp, self._buf = self._buf, ''
return tmp
tmp, self._buf = self._buf[:size], self._buf[size:]
return tmp
minimal_adapter = MinimalAdapter()
for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
print(doc)
With this, running your example invocation exactly gives the output that you want.
This is probably only more memory efficient for larger files. The load_all
tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read()
) and fileinput
does some buffering as well (use strace
if your interested to find out how it behaves).
This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With