I'm trying to read the data from .sas7bdat format of SAS using pandas function read_sas: <pre class="prettyprint"><code>import pandas as pd df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat') df.head() </code></pre> And I have two data types in the df dataframe - float64 and object. I completely satisfied with the float64 datatype, so I can freely convert it to int, string etc. The problem is with object data type, which I can see in the df dataframe wrapped like this: <pre class="prettyprint"><code>b'Text' </code></pre> or like this: <pre class="prettyprint"><code>b'12345' </code></pre> instead of <pre class="prettyprint"><code>Text </code></pre> or <pre class="prettyprint"><code>12345 </code></pre> I can't convert it to string or int respectively or to "normal" object data type. Also I can't eleminate b'' using slice or replace technics. So I'm not able to use columns with the object data type. Please, tell me how can I get rid of b''.

add this <code>encoding="utf-8"</code> so the line would be as follows: <pre class="prettyprint"><code>df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat', encoding="utf-8") </code></pre>

The <code>encoding</code> argument in <code>pd.read_sas()</code> leads me to have very large dataframes which lead me to have memory related errors. An other way to deal with the problem would be to <code>convert</code> the byte strings to an other encoding (e.g. <code>utf8</code>). <h3>Example:</h3> Example dataframe: <pre class="prettyprint lang-py prettyprint-override"><code> df = pd.DataFrame({"A": [1, 2, 3], "B": [b"a", b"b", b"c"], "C": ["a", "b", "c"]}) </code></pre> Transform byte strings to strings: <pre class="prettyprint lang-py prettyprint-override"><code>for col in df: if isinstance(df[col][0], bytes): print(col, "will be transformed from bytestring to string") df[col] = df[col].str.decode("utf8") # or any other encoding print(df) </code></pre> output: <pre class="prettyprint"><code> A B C 0 1 a a 1 2 b b 2 3 c c </code></pre> Useful links: <ol> <li>Pandas Series.str.decode() page of GeeksforGeeks (where I found my solution)</li> <li>What is the difference between a string and a byte string?</li> </ol>

How to get Text from b'Text' in the pandas object type after using read_sas?

Tags:

python

object

pandas

dataframe

I'm trying to read the data from .sas7bdat format of SAS using pandas function read_sas:

import pandas as pd
df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat')
df.head()

And I have two data types in the df dataframe - float64 and object. I completely satisfied with the float64 datatype, so I can freely convert it to int, string etc. The problem is with object data type, which I can see in the df dataframe wrapped like this:

b'Text'

or like this:

b'12345'

instead of

Text

I can't convert it to string or int respectively or to "normal" object data type. Also I can't eleminate b'' using slice or replace technics. So I'm not able to use columns with the object data type. Please, tell me how can I get rid of b''.

872

asked Aug 13 '16 07:08

doktr

3 Answers

add this encoding="utf-8"

so the line would be as follows:

df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat', encoding="utf-8")

answered Sep 19 '22 18:09

MAFiA303

First, figure out your sas dataset encoding. In SAS, run proc contents on the dataset. Check the "Encoding". In my case, my encoding was "latin1 Western (ISO)". Then enter your encoding as such:

df = pd.read_sas('filename', format = 'sas7bdat', encoding = 'latin-1')

answered Sep 18 '22 18:09

Eric

The encoding argument in pd.read_sas() leads me to have very large dataframes which lead me to have memory related errors.

An other way to deal with the problem would be to convert the byte strings to an other encoding (e.g. utf8).

Example:

Example dataframe:


df = pd.DataFrame({"A": [1, 2, 3], 
                   "B": [b"a", b"b", b"c"], 
                   "C": ["a", "b", "c"]})

Transform byte strings to strings:

for col in df:
    if isinstance(df[col][0], bytes):
        print(col, "will be transformed from bytestring to string")
        df[col] = df[col].str.decode("utf8")  # or any other encoding
print(df)

output:

   A  B  C
0  1  a  a
1  2  b  b
2  3  c  c

Useful links:

Pandas Series.str.decode() page of GeeksforGeeks (where I found my solution)
What is the difference between a string and a byte string?

answered Sep 21 '22 18:09

Adrien Pacifico

Related questions
                            
                                Django: Grab a set of objects from ID list (and sort by timestamp)
                            
                                Negative look ahead python regex
                            
                                How to use an image for the background in tkinter?
                            
                                Splitting path strings into drive, path and file name parts
                            
                                does ndb have a list property
                            
                                How to pass multiple values for a single URL parameter?
                            
                                How to run SVN commands from a python script?
                            
                                Python flask jinja image file not found
                            
                                List of objects with a unique attribute
                            
                                One-step initialization of defaultdict that appends to list?
                            
                                pylab histogram get rid of nan
                            
                                Sort dict by highest value? [duplicate]
                            
                                How to change legend fontname in matplotlib
                            
                                Python comparison operators chaining/grouping left to right?
                            
                                Python multiprocessing with pathos
                            
                                Check whether element is clickable in selenium
                            
                                TemplateSyntaxError: expected token ':', got '}'
                            
                                Identifying consecutive NaNs with Pandas
                            
                                How to set first N elements of array to zero?
                            
                                Postgresql ON CONFLICT in sqlalchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With