How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regression, zipcode is treated as categorical and not numeric. Thanks!
df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}}) print (df) bathrooms bedrooms floors sqft_living sqft_lot zipcode 722 3.25 4 2.0 4670 51836 98005 2680 0.75 2 1.0 1440 3700 98107 14554 2.50 4 2.0 3180 9603 98155 17384 1.50 2 3.0 1430 1650 98125 18754 1.00 2 1.0 1130 2640 98109
astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type. DataFrame. astype() function comes very handy when we want to case a particular column data type to another data type.
If you want to change the data type for all columns in the DataFrame to the string type, you can use df. applymap(str) or df.
You need astype
:
df['zipcode'] = df.zipcode.astype(str) #df.zipcode = df.zipcode.astype(str)
For converting to categorical
:
df['zipcode'] = df.zipcode.astype('category') #df.zipcode = df.zipcode.astype('category')
Another solution is Categorical
:
df['zipcode'] = pd.Categorical(df.zipcode)
Sample with data:
import pandas as pd df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df) bathrooms bedrooms floors sqft_living sqft_lot zipcode 722 3.25 4 2.0 4670 51836 98005 2680 0.75 2 1.0 1440 3700 98107 14554 2.50 4 2.0 3180 9603 98155 17384 1.50 2 3.0 1430 1650 98125 18754 1.00 2 1.0 1130 2640 98109 print (df.dtypes) bathrooms float64 bedrooms int64 floors float64 sqft_living int64 sqft_lot int64 zipcode int64 dtype: object df['zipcode'] = df.zipcode.astype('category') print (df) bathrooms bedrooms floors sqft_living sqft_lot zipcode 722 3.25 4 2.0 4670 51836 98005 2680 0.75 2 1.0 1440 3700 98107 14554 2.50 4 2.0 3180 9603 98155 17384 1.50 2 3.0 1430 1650 98125 18754 1.00 2 1.0 1130 2640 98109 print (df.dtypes) bathrooms float64 bedrooms int64 floors float64 sqft_living int64 sqft_lot int64 zipcode category dtype: object
With pandas >= 1.0 there is now a dedicated string datatype:
1) You can convert your column to this pandas string datatype using .astype('string'):
df['zipcode'] = df['zipcode'].astype('string')
2) This is different from using str
which sets the pandas object datatype:
df['zipcode'] = df['zipcode'].astype(str)
3) For changing into categorical datatype use:
df['zipcode'] = df['zipcode'].astype('category')
You can see this difference in datatypes when you look at the info of the dataframe:
df = pd.DataFrame({ 'zipcode_str': [90210, 90211] , 'zipcode_string': [90210, 90211], 'zipcode_category': [90210, 90211], }) df['zipcode_str'] = df['zipcode_str'].astype(str) df['zipcode_string'] = df['zipcode_str'].astype('string') df['zipcode_category'] = df['zipcode_category'].astype('category') df.info() # you can see that the first column has dtype object # while the second column has the new dtype string # the third column has dtype category # Column Non-Null Count Dtype --- ------ -------------- ----- 0 zipcode_str 2 non-null object 1 zipcode_string 2 non-null string 2 zipcode_category 2 non-null category dtypes: category(1), object(1), string(1)
The 'string' extension type solves several issues with object-dtype NumPy arrays:
You can accidentally store a mixture of strings and non-strings in an object dtype array. A StringArray can only store strings.
object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.
When reading code, the contents of an object dtype array is less clear than string.
More info on working with the new string datatype can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With