Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bokeh: Repeated plotting in jupyter lab increases (browser) memory usage

I am using Bokeh to plot many time-series (>100) with many points (~20,000) within a Jupyter Lab Notebook. When executing the cell multiple times in Jupyter the memory consumption of Chrome increases per run by over 400mb. After several cell executions Chrome tends to crash, usually when several GB of RAM usage are accumulated. Further, the plotting tends to get slower after each execution.

A "Clear [All] Outputs" or "Restart Kernel and Clear All Outputs..." in Jupyter also does not free any memory. In a classic Jupyter Notebook as well as with Firefox or Edge the issue also occurs.

Minimal version of my .ipynp:

import numpy as np
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
import bokeh
output_notebook() # See e.g.: https://github.com/bokeh/bokeh-notebooks/blob/master/tutorial/01%20-%20Basic%20Plotting.ipynb
# Just create a list of numpy arrays with random-walks as dataset
ts_length = 20000
n_lines = 100
np.random.seed(0)
dataset = [np.cumsum(np.random.randn(ts_length)) + i*100 for i in range(n_lines)]
# Plot exactly the same linechart every time
plot = figure(x_axis_type="linear")
for data in dataset:
    plot.line(x=range(ts_length), y=data)
show(plot)

This 'memory-leak' behavior continues, even if I execute the following cell every time before re-executing the (plot) cell above:

bokeh.io.curdoc().clear()  
bokeh.io.state.State().reset()
bokeh.io.reset_output()

output_notebook() # has to be done again because output was reset


Are there any additional mechanism in Bokeh that I might have overlooked, which would allow me to clean up the plot and to free the memory (in browser/js/client)?

Do I have to plot (or show the plot) somehow else within a Jupyter Notebook to avoid this issue? Or is this simply a bug of Bokeh/Jupyter?


Installed Versions on my System (Windows 10):

  • Python 3.6.6 : Anaconda custom (64-bit)
  • bokeh: 1.4.0
  • Chrome: 78.0.3904.108
  • jupyter:
    • core: 4.6.1
    • lab: 1.1.4
    • ipywidgets: 7.5.1
    • labextensions:
      • @bokeh/jupyter_bokeh: v1.1.1
      • @jupyter-widgets/jupyterlab-manager: v1.0.*
like image 450
pch Avatar asked Dec 04 '19 16:12

pch


1 Answers

TLDR; This is probably worth making an issue(s) for.

Memory usage

Just some notes about different aspects:

Clear/Reset functions

First to note, these:

bokeh.io.curdoc().clear()  
bokeh.io.state.State().reset()
bokeh.io.reset_output()

Only affect data structures in the Python process (e.g. the Jupyter Kernel). They will never have any effect on the browser memory usage or footprint.

One-time Memory footprint

Based on just the data I'd expect some where in the neighborhood of ~64MB:

20000 * 100 * 2 * 2  * 8 = 64MB

That's: 100 lines with 20k (x,y) points, which will also be converted to (sx,sy) screen coordinates, all in float64 (8byte) types arrays. However, Bokeh also constructs a spatial index for all data to support things like hover tools. I expect you are blowing up this index with this data. it is probably worth making this feature configurable so that folks who do not need hit testing do not have to pay for it. A feature-request issue to discuss this would be appropriate.

Repeated Execution

There are supposed to be DOM event triggers that will clean up when a notebook cell is re-executed. Perhaps these have become broken? Maintaining integrations between three large hybrid Python/JS tools (including classic Notebook) with a tiny team is unfortunately an ongoing challenge. A bug-report issue would be appropriate so that this can be tracked and investigated.

Other options

What can you do, right now?

More optimal usage

At least for the specific case you have here with timeseries all of the same length, that above code is structured in a very suboptimal way. You should try putting everything in a single ColumnDataSource instead:

ts_length = 20000
n_lines = 100
np.random.seed(0)

source = ColumnDataSource(data=dict(x=np.arange(ts_length)))
for i in range(n_lines):
    source.data[f"y{i}"] = np.cumsum(np.random.randn(ts_length)) + i*100

plot = figure()
for i in range(n_lines):
    plot.line(x='x', y=f"y{i}", source=source)
show(plot)

By passing sequence literals to line, your code results in the creation 99 unnecessary CDS objects (one per line call). Also does not re-used the x data, resulting in sending 99*20k extra points to BokehJS unnecessarily. And by sending a plain list instead of a numpy array, these also all get encoded using the less efficient (in time and space) default JSON encoding, instead of the efficient binary encoding that is available for numpy arrays.

That said, this is not causing all the issues here, and is probably not a solution on its own. But I wanted to make sure to point it out.

Datashader

For this many points, you might consider using DataShader in conjunction with Bokeh. The Holoviews library also integrates Bokeh and Datashader automatically at a high level. By pre-rendering images on the Python side, Datashader is effectively a bandwidth compression tool (among other things).

PNG export

Bokeh tilts trade-off towards affording various kinds of interactivity. But if you don't actually need that interactivity, then you are paying some extra costs. If that's your situation, you could consider generating static PNGs instead:

from bokeh.io.export import get_screenshot_as_png
p = get_screenshot_as_png(plot)

enter image description here

You'll need to install the additional optional dependencies listed in Exporting Plots and if you are doing many plots you might want to consider saving and reusing a webdriver explicitly for each call.

like image 150
bigreddot Avatar answered Oct 12 '22 18:10

bigreddot