There is a dataframe like the following, and it has one unclean column 'id' which it sholud be numeric column
id, name 1, A 2, B 3, C tt, D 4, E 5, F de, G
Is there a concise way to remove the rows because tt and de are not numeric values
tt,D de,G
to make the dataframe clean?
id, name 1, A 2, B 3, C 4, E 5, F
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .
The Pandas drop() function in Python is used to drop specified labels from rows and columns. Drop is a major function used in data science & Machine Learning to clean the dataset. Pandas Drop() function removes specified labels from rows or columns.
Using pd.to_numeric
In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()] Out[1079]: id name 0 1 A 1 2 B 2 3 C 4 4 E 5 5 F
You could use standard method of strings isnumeric
and apply it to each value in your id
column:
import pandas as pd from io import StringIO data = """ id,name 1,A 2,B 3,C tt,D 4,E 5,F de,G """ df = pd.read_csv(StringIO(data)) In [55]: df Out[55]: id name 0 1 A 1 2 B 2 3 C 3 tt D 4 4 E 5 5 F 6 de G In [56]: df[df.id.apply(lambda x: x.isnumeric())] Out[56]: id name 0 1 A 1 2 B 2 3 C 4 4 E 5 5 F
Or if you want to use id
as index you could do:
In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id') Out[61]: name id 1 A 2 B 3 C 4 E 5 F
Although case with pd.to_numeric
is not using apply
method it is almost two times slower than with applying np.isnumeric
for str
columns. Also I add option with using pandas str.isnumeric
which is less typing and still faster then using pd.to_numeric
. But pd.to_numeric
is more general because it could work with any data types (not only strings).
df_big = pd.concat([df]*10000) In [3]: df_big = pd.concat([df]*10000) In [4]: df_big.shape Out[4]: (70000, 2) In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())] 15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [6]: %timeit df_big[df_big.id.str.isnumeric()] 20.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()] 29.9 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With