Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove non-numeric rows in one column with pandas

Tags:

python

pandas

There is a dataframe like the following, and it has one unclean column 'id' which it sholud be numeric column

id, name 1,  A 2,  B 3,  C tt, D 4,  E 5,  F de, G 

Is there a concise way to remove the rows because tt and de are not numeric values

tt,D de,G 

to make the dataframe clean?

id, name 1,  A 2,  B 3,  C 4,  E 5,  F 
like image 525
HungUnicorn Avatar asked Nov 27 '15 15:11

HungUnicorn


People also ask

How do I delete rows in pandas DataFrame based on condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How delete all NaN rows in pandas?

By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .

How do I delete a categorical column in pandas?

The Pandas drop() function in Python is used to drop specified labels from rows and columns. Drop is a major function used in data science & Machine Learning to clean the dataset. Pandas Drop() function removes specified labels from rows or columns.


2 Answers

Using pd.to_numeric

In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()] Out[1079]:   id  name 0  1     A 1  2     B 2  3     C 4  4     E 5  5     F 
like image 126
Zero Avatar answered Oct 11 '22 18:10

Zero


You could use standard method of strings isnumeric and apply it to each value in your id column:

import pandas as pd from io import StringIO  data = """ id,name 1,A 2,B 3,C tt,D 4,E 5,F de,G """  df = pd.read_csv(StringIO(data))  In [55]: df Out[55]:     id name 0   1    A 1   2    B 2   3    C 3  tt    D 4   4    E 5   5    F 6  de    G  In [56]: df[df.id.apply(lambda x: x.isnumeric())] Out[56]:    id name 0  1    A 1  2    B 2  3    C 4  4    E 5  5    F 

Or if you want to use id as index you could do:

In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id') Out[61]:     name id      1     A 2     B 3     C 4     E 5     F 

Edit. Add timings

Although case with pd.to_numeric is not using apply method it is almost two times slower than with applying np.isnumeric for str columns. Also I add option with using pandas str.isnumeric which is less typing and still faster then using pd.to_numeric. But pd.to_numeric is more general because it could work with any data types (not only strings).

df_big = pd.concat([df]*10000)  In [3]: df_big = pd.concat([df]*10000)  In [4]: df_big.shape Out[4]: (70000, 2)  In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())] 15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  In [6]: %timeit df_big[df_big.id.str.isnumeric()] 20.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()] 29.9 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
like image 28
Anton Protopopov Avatar answered Oct 11 '22 17:10

Anton Protopopov