Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plotting Large Datasets in IPython Notebook (Bokeh)

I have a large dataset that I would like to plot in an IPython notebook.

I read the ~0.5GB .csv file into a Pandas DataFrame using read_csv, this takes about two minutes. Then I try to plot this data.

data = pd.read_csv('large.csv')
output_notebook()
p1 = figure()
p1.circle(data.index, data['myDataset'])
show(p1)

My browser spins and does not show me any plots. I have tried the following:

  1. output_file() instead of output_notebook()
  2. Graphing using a ColumnSource object as the source argument to circle()
  3. Downsampling my data to something more manageable.

Bokeh claims on its website to offer "high-performance interactivity over very large or streaming datasets". How do I visualize these large datasets without my computer grinding to a halt?

like image 664
Dylan Kirkby Avatar asked Dec 20 '15 05:12

Dylan Kirkby


Video Answer


1 Answers

The question is too broad to offer any specific code suggestions. I would be curious what the size of the downsampling you tried was. The default HTML Canvas for Bokeh can definitely accommodate tens of thousands of circles. There are a few options:

  • for simple scatters and lines of hundreds of thousands of points, there is a WebGL backend that may be useful.

    http://docs.bokeh.org/en/latest/docs/user_guide/webgl.html

  • using the Bokeh Server, create a Bokeh app to downsample the data before rendering it. There are some app examples here:

    https://github.com/bokeh/bokeh/tree/master/examples/app

  • The DataShader library can be used to perform downsampling of large data sets (hundreds of millions to billions of points), and integrates very well with Bokeh.

like image 94
bigreddot Avatar answered Oct 31 '22 11:10

bigreddot