Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Persist Completed Pipeline in Luigi Visualiser

Tags:

python

luigi

I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.

Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.

One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.

So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?

Appreciate all the help

like image 491
jpavs Avatar asked Sep 17 '15 17:09

jpavs


People also ask

What is Luigi pipeline?

Luigi is a popular module of Python programming language that enables you to build advanced pipelines to accomplish batch jobs. This module finds application in tasks such as Dependency Resolution, management of Workflows, Data Visualization, etc.

Does Luigi have a scheduler?

By default, Luigi tasks run using the Luigi scheduler. To run one of your previous tasks using the Luigi scheduler omit the --local-scheduler argument from the command.

Does Luigi work on Windows?

Most Luigi functionality works on Windows. Exceptions: Specifying multiple worker processes using the workers argument for luigi. build , or using the --workers command line argument.

What is Luigi ETL?

Luigi is a Python package or module designed for handling complex workflows, batch jobs, and visualizations for managing multiple pipelines. These pipelines collate data into a single destination ready for data analysis by tools such as Apache Hive.


1 Answers

I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:

luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)
like image 190
Charlie Haley Avatar answered Oct 22 '22 23:10

Charlie Haley