I was coding my first project in Python 3.6 using Spotify's Luigi to arrange some Natural Language Processing Tasks in a pipeline.
I noticed that the output()
function of a Task
class always returns some kind of Target
object, which is just some file somewhere, be it local or remote. Because my Tasks produce more complex data structures like parse trees, it's pretty awkward for me to write them into files as strings and read them again after.
Therefore I would like to ask if there is any possibility to pass Python objects between the tasks within a pipeline?
Parameters is the Luigi equivalent of creating a constructor for each Task. Luigi requires you to declare these parameters by instantiating Parameter objects on the class scope: class DailyReport(luigi. contrib.
worker module. The worker communicates with the scheduler and does two things: Sends all tasks that has to be run. Gets tasks from the scheduler that should be run.
Luigi is a Python package that manages long-running batch processing, which is the automated running of data processing jobs on batches of items. Luigi allows you to define a data processing job as a set of dependent tasks. For example, task B depends on the output of task A.
Short answer: No.
Luigi parameters are limited to date/datetime objects, string, int and float. See docs for reference.
That means that you need to serialize your complex data structure as a string (using json, msgpack, whatever serializer you like, and even compress it) and pass it as a string parameter.
Of course, you may write a custom Parameter subclass, but you'll need to implement the serialize and parse methods basically.
But take into account: if you use parameters instead of saving your calculated data to a target, you will be loosing one key advantage of using Luigi: if the parent task in the tree fails more than the count of retries you specify, then you´ll need to run the task that calculates that complex data structure again. If your tasks calculates complex data or takes a considerable amount of time or consumes a lot of resources, then you should save the output as a target in order to not having to do all that expensive computation again.
And looking beyond: another task may need that data too, so why not save it?
Also, notice that targets are not only files: you may save your data to a database table, Redis, Hadoop, an Elastic Search index, and many more: http://luigi.readthedocs.io/en/stable/api/luigi.contrib.html#submodules
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With