Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scatter plot on large amount of data

Let's say i've got a large dataset(8500000X50). And i would like to scatter plot X(date) and Y(the measurement that was taken at a certain day).

I could get only this: enter image description here

data_X = data['date_local']
data_Y = data['arithmetic_mean']
data_Y = data_Y.round(1)
data_Y = data_Y.astype(int)
data_X = data_X.astype(int)
sns.regplot(data_X, data_Y, data=data)
plt.show()

According to somehow 'same' questions i've found at Stackoverflow, i can shuffle my data or take for example 1000 random values and plot them. But how to implement it in such a manner that every X(date when the certain measurement was taken) will correspond to actual(Y measurement).

like image 783
dodo4545 Avatar asked Jul 13 '17 22:07

dodo4545


People also ask

How do you graph a large amount of data?

Scatter plots are best for showing distribution in large data sets.

What is scatter plot in big data?

Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data points on a two-dimensional plane or on a Cartesian system. The independent variable or attribute is plotted on the X-axis, while the dependent variable is plotted on the Y-axis.

What are the limitations of a scatter plot?

Limitations of a Scatter Diagram Scatter diagrams cannot give you the exact extent of correlation. A scatter diagram does not show a quantitative measurement of the relationship between the variables. It only shows the quantitative expression of quantitative change.


1 Answers

First, answering your question:

You should use pandas.DataFrame.sample to get a sample from your dateframe, and then use regplot, below is a small example using random data:

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
import numpy as np
import pandas as pd
import seaborn as sns

dates = pd.date_range('20080101', periods=10000, freq="D")
df = pd.DataFrame({"dates": dates, "data": np.random.randn(10000)})
   
dfSample = df.sample(1000) # This is the importante line
xdataSample, ydataSample = dfSample["dates"], dfSample["data"]

sns.regplot(x=mdates.date2num(xdataSample.astype(datetime)), y=ydataSample) 
plt.show()

On regplot I perform a convertion in my X data because of datetime's type, notice this definitely should not be necessary depending on your data.

So, instead of something like this:

You'll get something like this:


Now, a suggestion:

Use sns.jointplot, which has a kind parameter, from the docs:

kind : { “scatter” | “reg” | “resid” | “kde” | “hex” }, optional

Kind of plot to draw.

What we create here is a similar of what matplotlib's hist2d does, it creates something like a heatmap, using your entire dataset. An example using random data:

dates = pd.date_range('20080101', periods=10000, freq="D")
df = pd.DataFrame({"dates": dates, "data": np.random.randn(10000)})

xdata, ydata = df["dates"], df["data"]
sns.jointplot(x=mdates.date2num(xdata.astype(datetime)), y=ydata, kind="kde")

plt.show()

This results in this image, which is also good for seeing the distributions along your desired axis:

like image 120
Vinícius Figueiredo Avatar answered Oct 16 '22 11:10

Vinícius Figueiredo