Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe convert column type to string or categorical

How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regression, zipcode is treated as categorical and not numeric. Thanks!

df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}}) print (df)        bathrooms  bedrooms  floors  sqft_living  sqft_lot  zipcode 722         3.25         4     2.0         4670     51836    98005 2680        0.75         2     1.0         1440      3700    98107 14554       2.50         4     2.0         3180      9603    98155 17384       1.50         2     3.0         1430      1650    98125 18754       1.00         2     1.0         1130      2640    98109 
like image 562
jklaus Avatar asked Aug 23 '16 03:08

jklaus


People also ask

How do you change column type to categorical pandas?

astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type. DataFrame. astype() function comes very handy when we want to case a particular column data type to another data type.

How do I change a column type to a string?

If you want to change the data type for all columns in the DataFrame to the string type, you can use df. applymap(str) or df.


2 Answers

You need astype:

df['zipcode'] = df.zipcode.astype(str) #df.zipcode = df.zipcode.astype(str) 

For converting to categorical:

df['zipcode'] = df.zipcode.astype('category') #df.zipcode = df.zipcode.astype('category') 

Another solution is Categorical:

df['zipcode'] = pd.Categorical(df.zipcode) 

Sample with data:

import pandas as pd  df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}}) 
print (df)        bathrooms  bedrooms  floors  sqft_living  sqft_lot  zipcode 722         3.25         4     2.0         4670     51836    98005 2680        0.75         2     1.0         1440      3700    98107 14554       2.50         4     2.0         3180      9603    98155 17384       1.50         2     3.0         1430      1650    98125 18754       1.00         2     1.0         1130      2640    98109  print (df.dtypes) bathrooms      float64 bedrooms         int64 floors         float64 sqft_living      int64 sqft_lot         int64 zipcode          int64 dtype: object  df['zipcode'] = df.zipcode.astype('category')  print (df)        bathrooms  bedrooms  floors  sqft_living  sqft_lot zipcode 722         3.25         4     2.0         4670     51836   98005 2680        0.75         2     1.0         1440      3700   98107 14554       2.50         4     2.0         3180      9603   98155 17384       1.50         2     3.0         1430      1650   98125 18754       1.00         2     1.0         1130      2640   98109  print (df.dtypes) bathrooms       float64 bedrooms          int64 floors          float64 sqft_living       int64 sqft_lot          int64 zipcode        category dtype: object 
like image 157
jezrael Avatar answered Sep 19 '22 04:09

jezrael


With pandas >= 1.0 there is now a dedicated string datatype:

1) You can convert your column to this pandas string datatype using .astype('string'):

df['zipcode'] = df['zipcode'].astype('string') 

2) This is different from using str which sets the pandas object datatype:

df['zipcode'] = df['zipcode'].astype(str) 

3) For changing into categorical datatype use:

df['zipcode'] = df['zipcode'].astype('category') 

You can see this difference in datatypes when you look at the info of the dataframe:

df = pd.DataFrame({     'zipcode_str': [90210, 90211] ,     'zipcode_string': [90210, 90211],     'zipcode_category': [90210, 90211], })  df['zipcode_str'] = df['zipcode_str'].astype(str) df['zipcode_string'] = df['zipcode_str'].astype('string') df['zipcode_category'] = df['zipcode_category'].astype('category')  df.info()  # you can see that the first column has dtype object # while the second column has the new dtype string # the third column has dtype category  #   Column            Non-Null Count  Dtype    ---  ------            --------------  -----     0   zipcode_str       2 non-null      object    1   zipcode_string    2 non-null      string    2   zipcode_category  2 non-null      category dtypes: category(1), object(1), string(1) 

From the docs:

The 'string' extension type solves several issues with object-dtype NumPy arrays:

  1. You can accidentally store a mixture of strings and non-strings in an object dtype array. A StringArray can only store strings.

  2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.

  3. When reading code, the contents of an object dtype array is less clear than string.

More info on working with the new string datatype can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

like image 28
Sander van den Oord Avatar answered Sep 19 '22 04:09

Sander van den Oord