Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use plotly in Google Colab to plot dataset with more than 6M rows

Let me give you some context first. I was able to import a kaggle competition (m5 accuracy) whole data and works really great. The problem is that when I try to do a histogram using plotly (without agregations) doesn't return anything. But when I am using a Sample it plots properly.

Some additional info:

  • I used plotly's renderer for google colab.
  • The number of rows in the dataframe is 56M and my sample of 10M.
  • I ran matplotlib and seaborn histograms and display successfully histograms based on all data.
  • I tried to run the histogram with a smaller dataframe with 6M. The same situation occurs but i was able to plot with a 2M sample.
  • I tried a histogram with tips dataframe from seaborn and plots properly the histogram.
  • When graphs based on aggregations are created it works perfectly.

Here is the link to my code. https://colab.research.google.com/drive/1uMU3ctDzkGObYeCfxF36hURT9WIvnrl7?usp=sharing

I know this is not a limitation for doing a well design analysis, but I am wondered if is possible to use all data available and what is creating this problem. Thank you for reading me.

like image 749
Jose_Chavez Avatar asked Nov 07 '22 06:11

Jose_Chavez


1 Answers

It's already in an issue here

The solution is to do aggregation first, e.g. collections.Counter(). Then plot a barchart instead.

like image 83
korakot Avatar answered Nov 14 '22 23:11

korakot