Scatter plot on large amount of data

Tags:

Let's say i've got a large dataset(8500000X50). And i would like to scatter plot X(date) and Y(the measurement that was taken at a certain day).

I could get only this: enter image description here

data_X = data['date_local']
data_Y = data['arithmetic_mean']
data_Y = data_Y.round(1)
data_Y = data_Y.astype(int)
data_X = data_X.astype(int)
sns.regplot(data_X, data_Y, data=data)
plt.show()

According to somehow 'same' questions i've found at Stackoverflow, i can shuffle my data or take for example 1000 random values and plot them. But how to implement it in such a manner that every X(date when the certain measurement was taken) will correspond to actual(Y measurement).

783

asked Jul 13 '17 22:07

dodo4545

1 Answers

First, answering your question:

You should use pandas.DataFrame.sample to get a sample from your dateframe, and then use regplot, below is a small example using random data:

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
import numpy as np
import pandas as pd
import seaborn as sns

dates = pd.date_range('20080101', periods=10000, freq="D")
df = pd.DataFrame({"dates": dates, "data": np.random.randn(10000)})
   
dfSample = df.sample(1000) # This is the importante line
xdataSample, ydataSample = dfSample["dates"], dfSample["data"]

sns.regplot(x=mdates.date2num(xdataSample.astype(datetime)), y=ydataSample) 
plt.show()

On regplot I perform a convertion in my X data because of datetime's type, notice this definitely should not be necessary depending on your data.

So, instead of something like this:

You'll get something like this:

Now, a suggestion:

Use sns.jointplot, which has a kind parameter, from the docs:

kind : { “scatter” | “reg” | “resid” | “kde” | “hex” }, optional

Kind of plot to draw.

What we create here is a similar of what matplotlib's hist2d does, it creates something like a heatmap, using your entire dataset. An example using random data:

dates = pd.date_range('20080101', periods=10000, freq="D")
df = pd.DataFrame({"dates": dates, "data": np.random.randn(10000)})

xdata, ydata = df["dates"], df["data"]
sns.jointplot(x=mdates.date2num(xdata.astype(datetime)), y=ydata, kind="kde")

plt.show()

This results in this image, which is also good for seeing the distributions along your desired axis:

120

answered Oct 16 '22 11:10

Vinícius Figueiredo

Related questions
                            
                                Pandas: how to use between_time with milliseconds?
                            
                                Cythonize list of all splits of a string
                            
                                What is the gspread import_csv file_id parameter?
                            
                                Yelp data file type
                            
                                Matplotlib writing '±' in plot
                            
                                Python saving an eval function
                            
                                Django: Temporarily redirect all URLs to one view
                            
                                Attaching class labels to a Keras model
                            
                                How to rename the index of a Dask Dataframe
                            
                                Is there a way to determine whether a file is in YAML or JSON format?
                            
                                Python http.server not print log
                            
                                SQLAlchemy, prevent duplicate rows
                            
                                can't include Python.h in visual studio
                            
                                Is it safe to store per-request data on flask.request?
                            
                                Catch exception thrown in generator caller in Python
                            
                                How to revert changes in Pycharm
                            
                                Resampling a pandas dataframe with multi-index containing timeseries
                            
                                python: why does random.shuffle change the array
                            
                                Calling base class method after child class __init__ from base class __init__?
                            
                                Pythonic way to print 2D list -- Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scatter plot on large amount of data

Tags:

python

pandas

matplotlib

seaborn