I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
To get the nth row in a Pandas DataFrame, we can use the iloc() method. For example, df. iloc[4] will return the 5th row because row numbers start from 0.
DataFrame. drop() method you can drop/remove/delete rows from DataFrame. axis param is used to specify what axis you would like to remove. By default axis = 0 meaning to remove rows.
pandas.DataFrame.head() In Python's Pandas module, the Dataframe class provides a head() function to fetch top rows from a Dataframe i.e. It returns the first n rows from a dataframe.
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows
parameter in read_csv
, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange
with a length equal to the amount of rows to read, and remove every 500th
element from it using np.delete
, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With