Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get Text from b'Text' in the pandas object type after using read_sas?

I'm trying to read the data from .sas7bdat format of SAS using pandas function read_sas:

import pandas as pd
df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat')
df.head()

And I have two data types in the df dataframe - float64 and object. I completely satisfied with the float64 datatype, so I can freely convert it to int, string etc. The problem is with object data type, which I can see in the df dataframe wrapped like this:

b'Text'

or like this:

b'12345'

instead of

Text

or

12345

I can't convert it to string or int respectively or to "normal" object data type. Also I can't eleminate b'' using slice or replace technics. So I'm not able to use columns with the object data type. Please, tell me how can I get rid of b''.

like image 872
doktr Avatar asked Aug 13 '16 07:08

doktr


People also ask

Is object type the same as string in Pandas?

They can not only include strings, but also any other data that Pandas doesn't understand. How is this important? When a column is Object type, it does not necessarily mean that all the values will be string. In fact, they can all be numbers, or a mixture of string, integers and floats.

What does object type mean in Pandas?

An object is a string in pandas so it performs a string operation instead of a mathematical one. If we want to see what all the data types are in a dataframe, use df.dtypes. df.


3 Answers

add this encoding="utf-8"

so the line would be as follows:

df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat', encoding="utf-8")
like image 99
MAFiA303 Avatar answered Sep 19 '22 18:09

MAFiA303


First, figure out your sas dataset encoding. In SAS, run proc contents on the dataset. Check the "Encoding". In my case, my encoding was "latin1 Western (ISO)". Then enter your encoding as such:

df = pd.read_sas('filename', format = 'sas7bdat', encoding = 'latin-1')
like image 41
Eric Avatar answered Sep 18 '22 18:09

Eric


The encoding argument in pd.read_sas() leads me to have very large dataframes which lead me to have memory related errors.

An other way to deal with the problem would be to convert the byte strings to an other encoding (e.g. utf8).

Example:

Example dataframe:


df = pd.DataFrame({"A": [1, 2, 3], 
                   "B": [b"a", b"b", b"c"], 
                   "C": ["a", "b", "c"]})

Transform byte strings to strings:

for col in df:
    if isinstance(df[col][0], bytes):
        print(col, "will be transformed from bytestring to string")
        df[col] = df[col].str.decode("utf8")  # or any other encoding
print(df)

output:

   A  B  C
0  1  a  a
1  2  b  b
2  3  c  c

Useful links:

  1. Pandas Series.str.decode() page of GeeksforGeeks (where I found my solution)

  2. What is the difference between a string and a byte string?

like image 45
Adrien Pacifico Avatar answered Sep 21 '22 18:09

Adrien Pacifico