I have a DataFrame in which a column might have three kinds of values, integers (12331), integers as strings ('345') or some other string ('text').
Is there a way to drop all rows with the last kind of string from the dataframe, and convert the first kind of string into integers? Or at least some way to ignore the rows that cause type errors if I'm summing the column.
This dataframe is from reading a pretty big CSV file (25 GB), so I'd like some solution that would work when reading in chunks.
DataFrame. mean() function is used to get the mean of the values over the requested axis in pandas. This by default returns a Series, if level specified, it returns a DataFrame. By default ignore NaN values and performs mean on index axis.
Pandas has some tools for converting these kinds of columns, but they may not suit your needs exactly. pd.to_numeric
converts mixed columns like yours, but converts non-numeric strings to NaN
. This means you'll get float columns, not integer, since only float columns can have NaN
values. That usually doesn't matter too much but it's good to be aware of.
df = pd.DataFrame({'mixed_types': [12331, '345', 'text']})
pd.to_numeric(df['mixed_types'], errors='coerce')
Out[7]:
0 12331.0
1 345.0
2 NaN
Name: mixed_types, dtype: float64
If you want to then drop all the NaN
rows:
# Replace the column with the converted values
df['mixed_types'] = pd.to_numeric(df['mixed_types'], errors='coerce')
# Drop NA values, listing the converted columns explicitly
# so NA values in other columns aren't dropped
df.dropna(subset = ['mixed_types'])
Out[11]:
mixed_types
0 12331.0
1 345.0
you can use df._get_numeric_data() directly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With