Subsample pandas dataframe

Tags:

I have a DataFrame loaded from a .tsv file. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.

I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.

This: Sample two pandas dataframes the same way seems to be on the right track, but I cannot guarantee the subsample size.

588

asked Sep 10 '13 08:09

Nishant

1 Answers

You can select random elements from the index with np.random.choice. Eg to select 5 random rows:

df = pd.DataFrame(np.random.rand(10))

df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

df.loc[np.random.permutation(df.index)[:5]]

In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort() afterwards.

answered Sep 22 '22 01:09

joris

Related questions
                            
                                How to verify if a button is enabled and disabled in Webdriver Python?
                            
                                How to convert degree minute second to degree decimal
                            
                                OpenCV NoneType object has no attribute shape
                            
                                Tensorflow create a tfrecords file from csv
                            
                                Celery, run task once at a specified time
                            
                                PIP module has no attribute "main"
                            
                                Convert timestamp to date in Spark dataframe
                            
                                How to install python3.9 on linux ubuntu terminal?
                            
                                Python create function in a loop capturing the loop variable
                            
                                replace values in an array
                            
                                Python performance characteristics
                            
                                Python's Equivalent of "public static void main"
                            
                                Underline Text in Tkinter Label widget?
                            
                                How to append to a CSV file?
                            
                                Python's [<generator expression>] at least 3x faster than list(<generator expression>)?
                            
                                Assigning values to variables in a list using a loop
                            
                                How to turn off blinking cursor in command window?
                            
                                Set a default value for a ttk Combobox
                            
                                Pythonic way to check that the lengths of lots of lists are the same
                            
                                How do I rethrow an exception that contains information about an original exception?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Subsample pandas dataframe

Tags:

python

pandas

numpy

subsampling

Nishant

People also ask

1 Answers

joris

Recent Activity

Donate For Us