pandas dataframe convert column type to string or categorical

Tags:

How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regression, zipcode is treated as categorical and not numeric. Thanks!

df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}}) print (df)        bathrooms  bedrooms  floors  sqft_living  sqft_lot  zipcode 722         3.25         4     2.0         4670     51836    98005 2680        0.75         2     1.0         1440      3700    98107 14554       2.50         4     2.0         3180      9603    98155 17384       1.50         2     3.0         1430      1650    98125 18754       1.00         2     1.0         1130      2640    98109

562

asked Aug 23 '16 03:08

jklaus

2 Answers

You need astype:

df['zipcode'] = df.zipcode.astype(str) #df.zipcode = df.zipcode.astype(str)

For converting to categorical:

df['zipcode'] = df.zipcode.astype('category') #df.zipcode = df.zipcode.astype('category')

Another solution is Categorical:

df['zipcode'] = pd.Categorical(df.zipcode)

Sample with data:

import pandas as pd  df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})

print (df)        bathrooms  bedrooms  floors  sqft_living  sqft_lot  zipcode 722         3.25         4     2.0         4670     51836    98005 2680        0.75         2     1.0         1440      3700    98107 14554       2.50         4     2.0         3180      9603    98155 17384       1.50         2     3.0         1430      1650    98125 18754       1.00         2     1.0         1130      2640    98109  print (df.dtypes) bathrooms      float64 bedrooms         int64 floors         float64 sqft_living      int64 sqft_lot         int64 zipcode          int64 dtype: object  df['zipcode'] = df.zipcode.astype('category')  print (df)        bathrooms  bedrooms  floors  sqft_living  sqft_lot zipcode 722         3.25         4     2.0         4670     51836   98005 2680        0.75         2     1.0         1440      3700   98107 14554       2.50         4     2.0         3180      9603   98155 17384       1.50         2     3.0         1430      1650   98125 18754       1.00         2     1.0         1130      2640   98109  print (df.dtypes) bathrooms       float64 bedrooms          int64 floors          float64 sqft_living       int64 sqft_lot          int64 zipcode        category dtype: object

157

answered Sep 19 '22 04:09

jezrael

With pandas >= 1.0 there is now a dedicated string datatype:

1) You can convert your column to this pandas string datatype using .astype('string'):

df['zipcode'] = df['zipcode'].astype('string')

2) This is different from using str which sets the pandas object datatype:

df['zipcode'] = df['zipcode'].astype(str)

3) For changing into categorical datatype use:

df['zipcode'] = df['zipcode'].astype('category')

You can see this difference in datatypes when you look at the info of the dataframe:

df = pd.DataFrame({     'zipcode_str': [90210, 90211] ,     'zipcode_string': [90210, 90211],     'zipcode_category': [90210, 90211], })  df['zipcode_str'] = df['zipcode_str'].astype(str) df['zipcode_string'] = df['zipcode_str'].astype('string') df['zipcode_category'] = df['zipcode_category'].astype('category')  df.info()  # you can see that the first column has dtype object # while the second column has the new dtype string # the third column has dtype category  #   Column            Non-Null Count  Dtype    ---  ------            --------------  -----     0   zipcode_str       2 non-null      object    1   zipcode_string    2 non-null      string    2   zipcode_category  2 non-null      category dtypes: category(1), object(1), string(1)

From the docs:

The 'string' extension type solves several issues with object-dtype NumPy arrays:

You can accidentally store a mixture of strings and non-strings in an object dtype array. A StringArray can only store strings.

object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.

When reading code, the contents of an object dtype array is less clear than string.

More info on working with the new string datatype can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

answered Sep 19 '22 04:09

Sander van den Oord

Related questions
                            
                                How to slice a Pandas Data Frame by position?
                            
                                Python Pandas - Changing some column types to categories
                            
                                Access index in pandas.Series.apply
                            
                                No numeric types to aggregate - change in groupby() behaviour?
                            
                                Python/Pandas: counting the number of missing/NaN in each row
                            
                                How to convert OpenDocument spreadsheets to a pandas DataFrame?
                            
                                How to keep leading zeros in a column when reading CSV with Pandas?
                            
                                Correct way to set value on a slice in pandas [duplicate]
                            
                                How do I find the closest values in a Pandas series to an input number?
                            
                                pandas - change df.index from float64 to unicode or string
                            
                                T-test in Pandas
                            
                                How to have clusters of stacked bars with python (Pandas)
                            
                                Modify the legend of pandas bar plot
                            
                                Finding the intersection between two series in Pandas
                            
                                How to convert rows in DataFrame in Python to dictionaries
                            
                                Convert Select Columns in Pandas Dataframe to Numpy Array
                            
                                Coalesce values from 2 columns into a single column in a pandas dataframe
                            
                                What is the point of indexing in pandas?
                            
                                Create column of value_counts in Pandas dataframe
                            
                                Pandas read_sql with parameters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas dataframe convert column type to string or categorical

Tags:

type-conversion

pandas

dataframe

categorical-data

jklaus

People also ask

2 Answers

jezrael

Sander van den Oord

Recent Activity

Donate For Us