Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsample pandas dataframe

I have a DataFrame loaded from a .tsv file. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.

I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.

This: Sample two pandas dataframes the same way seems to be on the right track, but I cannot guarantee the subsample size.

like image 588
Nishant Avatar asked Sep 10 '13 08:09

Nishant


People also ask

How do you get samples in pandas?

Python pandas provides a function, named sample() to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the exact number of random rows to extract. specify the percentage of random rows to extract.

How do you import a sample dataset using pandas?

There are a total of three keys: namely integer, datetime, and category. First, you will import the pandas library and then pass the URL to the pd. read_json() which will return a dataframe. The columns of the dataframes represent the keys, and the rows are the values of the JSON.


1 Answers

You can select random elements from the index with np.random.choice. Eg to select 5 random rows:

df = pd.DataFrame(np.random.rand(10))

df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

df.loc[np.random.permutation(df.index)[:5]]

In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort() afterwards.

like image 55
joris Avatar answered Sep 22 '22 01:09

joris