I'm trying to replicate some of R4DS's dplyr exercises using Python's pandas, with the nycflights13.flights dataset. What I want to do is select, from that dataset:
year to day (inclusive);distance and air_time columnsIn the book, Hadley uses the following syntax:
library("tidyverse")
library("nycflights13")
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
In pandas, I came up with the following "solution":
import pandas as pd
from nycflights13 import flights
flights_sml = pd.concat([
flights.loc[:, 'year':'day'],
flights.loc[:, flights.columns.str.endswith("delay")],
flights.distance,
flights.air_time,
], axis=1)
Another possible implementation:
flights_sml = flights.filter(regex='year|day|month|delay$|^distance$|^air_time$', axis=1)
But I'm sure this is not the idiomatic way to write such DF-operation. I digged around, but haven't found something that fits in this situation from pandas API.
You are correct. This will create multiple dataframes/series and then concatenate them together, resulting in a lot of extra work. Instead, you can create a list of the columns you want to use and then simply select those.
For example (keeping the same column order):
cols = ['year', 'month', 'day'] + [col for col in flights.columns if col.endswith('delay')] + ['distance', 'air_time']
flights_sml = flights[cols]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With