Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Check if dataframe column contain string type

I want check if columns in a dataframe consists of strings so I can label them with numbers for machine learning purposes. Some columns consists of numbers, I dont want to change them. Columns example can be seen below:

TRAIN FEATURES
  Age              Level  
  32.0              Silver      
  61.0              Silver  
  66.0              Silver      
  36.0              Gold      
  20.0              Silver     
  29.0              Silver     
  46.0              Silver  
  27.0              Silver      

Thank you=)

like image 274
s900n Avatar asked Mar 27 '17 14:03

s900n


People also ask

How do I check the data type of a column in Python?

Python / October 2, 2020. You may use the following syntax to check the data type of all columns in Pandas DataFrame: df.dtypes. Alternatively, you may use the syntax below to check the data type of a particular column in Pandas DataFrame: df ['DataFrame Column'].dtypes.

How to check if a column contains a value in Dataframe?

Check Column Contains a Value in DataFrame Use in operator on a Series to check if a column contains/exists a string value in a pandas DataFrame. df ['Courses'] returns a Series object with all values from column Courses, pandas.Series.unique will return unique values of the Series object. Uniques are returned in order of appearance.

How to search for a string in a pandas column?

In this tutorial, we will look at how to search for a string (or a substring) in a pandas dataframe column with the help of some examples. How to check if a pandas series contains a string? You can use the pandas.series.str.contains () function to search for the presence of a string in a pandas series (or column of a dataframe).

How to check the data type in pandas Dataframe?

Steps to Check the Data Type in Pandas DataFrame Step 1: Gather the Data for the DataFrame To start, gather the data for your DataFrame. For illustration purposes, let’s... Step 2: Create the DataFrame Next, create the actual DataFrame based on the following syntax: import pandas as pd Data... Step ...


4 Answers

Notice that the above answers will include DateTime, TimeStamp, Category and other datatypes.

Using object is more restrictive (although I am not sure if other dtypes would also of object dtype):

  1. Create the dataframe:

    df = pd.DataFrame({
        'a': ['a','b','c','d'], 
        'b': [1, 'b', 'c', 2], 
        'c': [np.nan, 2, 3, 4], 
        'd': ['A', 'B', 'B', 'A'], 
        'e': pd.to_datetime('today')})
    df['d'] = df['d'].astype('category')
    

That will look like this:

   a  b    c  d          e
0  a  1  NaN  A 2018-05-17
1  b  b  2.0  B 2018-05-17
2  c  c  3.0  B 2018-05-17
3  d  2  4.0  A 2018-05-17
  1. You can check the types calling dtypes:

    df.dtypes
    
    a            object
    b            object
    c           float64
    d          category
    e    datetime64[ns]
    dtype: object
    
  2. You can list the strings columns using the items() method and filtering by object:

    > [ col  for col, dt in df.dtypes.items() if dt == object]
    ['a', 'b']
    
  3. Or you can use select_dtypes to display a dataframe with only the strings:

    df.select_dtypes(include=[object])
       a  b
    0  a  1
    1  b  b
    2  c  c
    3  d  2
    
like image 198
toto_tico Avatar answered Oct 17 '22 15:10

toto_tico


Yes, its possible. You use dtype

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['a','b','c','d']})
if df['a'].dtype != np.number:
    print('yes')
else:
    print('no')

You can also select your columns by dtype using select_dtypes

df_subset = df.select_dtypes(exclude=[np.number])
# Now apply you can label encode your df_subset
like image 35
Scratch'N'Purr Avatar answered Oct 17 '22 16:10

Scratch'N'Purr


I use a 2-step approach: first to determine if dtype==object, and then if so, I got the first row of data to see if that column's data was a string or not.

c = 'my_column_name'
if df[c].dtype == object and isinstance(df.iloc[0][c], str):
    # do something 
like image 12
hamx0r Avatar answered Oct 17 '22 14:10

hamx0r


4 years since the creation of this question and I believe there's still not a definitive answer.

I don't think strings were ever considered as a first class citizen in Pandas (even >= 1.0.0). As an example:

import pandas as pd
import datetime

df = pd.DataFrame({
    'str': ['a', 'b', 'c', None],
    'hete': [1, 2.0, datetime.datetime.utcnow(), None]
})

string_series = df['str']
print(string_series.dtype)
print(pd.api.types.is_string_dtype(string_series.dtype))

heterogenous_series = df['hete']
print(heterogenous_series.dtype)
print(pd.api.types.is_string_dtype(heterogenous_series.dtype))

prints

object
True
object
True

so although hete does not contain any explicit strings, it is considered as a string series.

After reading the documentation, I think the only way to make sure a series contains only strings is:

def is_string_series(s : pd.Series):
    if isinstance(s.dtype, pd.StringDtype):
        # The series was explicitly created as a string series (Pandas>=1.0.0)
        return True
    elif s.dtype == 'object':
        # Object series, check each value
        return all((v is None) or isinstance(v, str) for v in s)
    else:
        return False
like image 12
vc 74 Avatar answered Oct 17 '22 16:10

vc 74