Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Pandas filter string data based on its string length

I like to filter out data whose string length is not equal to 10.

If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.

df=pd.read_csv('filex.csv') df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan) df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan) df=df.dropna(subset=['A','B'], how='any') 

This works slow, but is working.

However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file).

  File "<stdin>", line 1, in <lambda> TypeError: object of type 'float' has no len() 

I believe there should be more efficient and elegant code instead of this.


Based on the answers and comments below, the simplest solution I found are:

df=df[df.A.apply(lambda x: len(str(x))==10] df=df[df.B.apply(lambda x: len(str(x))==10] 

or

df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)] 

or

df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)] 
like image 232
notilas Avatar asked Nov 12 '13 18:11

notilas


People also ask

How do you filter the length of a string?

To filter strings of an Array based on length in JavaScript, call Array. filter() method on this String Array, and pass a function as argument that returns true for the specific condition on string length or false otherwise.

How do pandas use their length?

len() method is used to determine length of each string in a Pandas series. This method is only for series of strings. Since this is a string method, . str has to be prefixed everytime before calling this method.

How do you find the length of a string in a DataFrame in Python?

To find the length of strings in a data frame you have the len method on the dataframes str property. But to do this you need to call this method on the column that contains the string data.


1 Answers

import pandas as pd  df = pd.read_csv('filex.csv') df['A'] = df['A'].astype('str') df['B'] = df['B'].astype('str') mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10) df = df.loc[mask] print(df) 

Applied to filex.csv:

A,B 123,abc 1234,abcd 1234567890,abcdefghij 

the code above prints

            A           B 2  1234567890  abcdefghij 
like image 62
unutbu Avatar answered Oct 03 '22 20:10

unutbu