I want check if columns in a dataframe consists of strings so I can label them with numbers for machine learning purposes. Some columns consists of numbers, I dont want to change them. Columns example can be seen below:
TRAIN FEATURES
Age Level
32.0 Silver
61.0 Silver
66.0 Silver
36.0 Gold
20.0 Silver
29.0 Silver
46.0 Silver
27.0 Silver
Thank you=)
Python / October 2, 2020. You may use the following syntax to check the data type of all columns in Pandas DataFrame: df.dtypes. Alternatively, you may use the syntax below to check the data type of a particular column in Pandas DataFrame: df ['DataFrame Column'].dtypes.
Check Column Contains a Value in DataFrame Use in operator on a Series to check if a column contains/exists a string value in a pandas DataFrame. df ['Courses'] returns a Series object with all values from column Courses, pandas.Series.unique will return unique values of the Series object. Uniques are returned in order of appearance.
In this tutorial, we will look at how to search for a string (or a substring) in a pandas dataframe column with the help of some examples. How to check if a pandas series contains a string? You can use the pandas.series.str.contains () function to search for the presence of a string in a pandas series (or column of a dataframe).
Steps to Check the Data Type in Pandas DataFrame Step 1: Gather the Data for the DataFrame To start, gather the data for your DataFrame. For illustration purposes, let’s... Step 2: Create the DataFrame Next, create the actual DataFrame based on the following syntax: import pandas as pd Data... Step ...
Notice that the above answers will include DateTime, TimeStamp, Category and other datatypes.
Using object
is more restrictive (although I am not sure if other dtypes
would also of object
dtype):
Create the dataframe:
df = pd.DataFrame({
'a': ['a','b','c','d'],
'b': [1, 'b', 'c', 2],
'c': [np.nan, 2, 3, 4],
'd': ['A', 'B', 'B', 'A'],
'e': pd.to_datetime('today')})
df['d'] = df['d'].astype('category')
That will look like this:
a b c d e
0 a 1 NaN A 2018-05-17
1 b b 2.0 B 2018-05-17
2 c c 3.0 B 2018-05-17
3 d 2 4.0 A 2018-05-17
You can check the types calling dtypes
:
df.dtypes
a object
b object
c float64
d category
e datetime64[ns]
dtype: object
You can list the strings columns using the items()
method and filtering by object
:
> [ col for col, dt in df.dtypes.items() if dt == object]
['a', 'b']
Or you can use select_dtypes to display a dataframe with only the strings:
df.select_dtypes(include=[object])
a b
0 a 1
1 b b
2 c c
3 d 2
Yes, its possible. You use dtype
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['a','b','c','d']})
if df['a'].dtype != np.number:
print('yes')
else:
print('no')
You can also select your columns by dtype using select_dtypes
df_subset = df.select_dtypes(exclude=[np.number])
# Now apply you can label encode your df_subset
I use a 2-step approach: first to determine if dtype==object
, and then if so, I got the first row of data to see if that column's data was a string or not.
c = 'my_column_name'
if df[c].dtype == object and isinstance(df.iloc[0][c], str):
# do something
4 years since the creation of this question and I believe there's still not a definitive answer.
I don't think strings were ever considered as a first class citizen in Pandas (even >= 1.0.0). As an example:
import pandas as pd
import datetime
df = pd.DataFrame({
'str': ['a', 'b', 'c', None],
'hete': [1, 2.0, datetime.datetime.utcnow(), None]
})
string_series = df['str']
print(string_series.dtype)
print(pd.api.types.is_string_dtype(string_series.dtype))
heterogenous_series = df['hete']
print(heterogenous_series.dtype)
print(pd.api.types.is_string_dtype(heterogenous_series.dtype))
prints
object
True
object
True
so although hete
does not contain any explicit strings, it is considered as a string series.
After reading the documentation, I think the only way to make sure a series contains only strings is:
def is_string_series(s : pd.Series):
if isinstance(s.dtype, pd.StringDtype):
# The series was explicitly created as a string series (Pandas>=1.0.0)
return True
elif s.dtype == 'object':
# Object series, check each value
return all((v is None) or isinstance(v, str) for v in s)
else:
return False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With