Select every nth row as a Pandas DataFrame without reading the entire file

Tags:

I am reading a large file that contains ~9.5 million rows x 16 cols.

I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.

I am able to load the data, and then select every 500th row.

My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   RatecodeID  store_and_fwd_flag  PULocationID    DOLocationID    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   1   2017-01-09 11:13:28 2017-01-09 11:25:45 1   3.30    1   N   263 161 1   12.5    0.0 0.5 2.00    0.00    0.3 15.30
1   1   2017-01-09 11:32:27 2017-01-09 11:36:01 1   0.90    1   N   186 234 1   5.0 0.0 0.5 1.45    0.00    0.3 7.25
2   1   2017-01-09 11:38:20 2017-01-09 11:42:05 1   1.10    1   N   164 161 1   5.5 0.0 0.5 1.00    0.00    0.3 7.30
3   1   2017-01-09 11:52:13 2017-01-09 11:57:36 1   1.10    1   N   236 75  1   6.0 0.0 0.5 1.70    0.00    0.3 8.50
4   2   2017-01-01 00:00:00 2017-01-01 00:00:00 1   0.02    2   N   249 234 2   52.0    0.0 0.5 0.00    0.00    0.3 52.80

503

asked Dec 17 '18 09:12

Omar Hijazi

1 Answers

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:

n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)

183

answered Sep 20 '22 00:09

yatu

Related questions
                            
                                Pandas merge how to avoid unnamed column
                            
                                How to plot rows in dataframe
                            
                                Replace outliers with column quantile in Pandas dataframe
                            
                                Pandas multiindex dataframe - Selecting max from one index within multiindex
                            
                                Find index where elements change value pandas dataframe
                            
                                python: pandas np.where vs. df.loc with multiple conditions
                            
                                Pandas df.itertuples renaming dataframe columns when printing
                            
                                Python, Seaborn: Plotting frequencies with zero-values
                            
                                How to sort a pandas series of both index and values? [duplicate]
                            
                                Create a dataframe of permutations in pandas from list
                            
                                pandas grouper vs time grouper
                            
                                Does Jupyter support 'read-only' notebooks?
                            
                                AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis', using pandas eval
                            
                                Pandas extractall() - return list, not a MultiLevel index?
                            
                                Querying timedelta column in pandas, and filtering rows
                            
                                How to merge two pandas time series objects with different date time indices?
                            
                                Grab the Memory Usage Value from Pandas DataFrame.info()
                            
                                how to remove a row which has empty column in a dataframe using pandas
                            
                                Understanding the "left_index" and "right_index" arguments in pandas merge
                            
                                Unknown string format on pd.to_datetime

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select every nth row as a Pandas DataFrame without reading the entire file

Tags:

pandas

dataframe

time-series

Omar Hijazi

People also ask

1 Answers

yatu

Recent Activity

Donate For Us