<p>A bunch of the tweets I am importing are having this issue where they read </p> <pre class="prettyprint"><code>b'I posted a new photo to Facebook' </code></pre> <p>I gather the <code>b</code> indicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the <code>b</code> doesn't go away and is interferring in future code. </p> <p>Is there a simple way to remove this <code>b</code> prefix from my lines of text? </p> <p>Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web. </p> <hr> <p>Here's the link content I'm analyzing:</p> <p>https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0</p> <pre class="prettyprint"><code>new_tweets = 'content in the link' </code></pre> <h3>Code Attempt</h3> <pre class="prettyprint"><code>outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets] print(outtweets) </code></pre> <h3>Error</h3> <pre class="prettyprint"><code>UnicodeEncodeError Traceback (most recent call last) <ipython-input-21-6019064596bf> in <module>() 1 for screen_name in user_list: ----> 2 get_all_tweets(screen_name,"instance file") <ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode) 99 with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f: 100 writer = csv.writer(f) --> 101 writer.writerows(outtweets) 102 else: 103 with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f: C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final) 17 class IncrementalEncoder(codecs.IncrementalEncoder): 18 def encode(self, input, final=False): ---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0] 20 21 class IncrementalDecoder(codecs.IncrementalDecoder): UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined> </code></pre>

<p>you need to decode the <code>bytes</code> of you want a string:</p> <pre class="prettyprint"><code>b = b'1234' print(b.decode('utf-8')) # '1234' </code></pre>

<p>It is just letting you know that the object you are printing is not a string, rather a byte object as a <strong>byte literal</strong>. People explain this in incomplete ways, so here is my take.</p> <p>Consider creating a <strong>byte object</strong> by typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a <strong>string object</strong> encoded in utf-8. (Note that converting here means <em>decoding</em>)</p> <pre class="prettyprint"><code>byte_object= b"test" # byte object by literally typing characters print(byte_object) # Prints b'test' print(byte_object.decode('utf8')) # Prints "test" without quotations </code></pre> <p>You see that we simply apply the <code>.decode(utf8)</code> function.</p> <h3>Bytes in Python</h3> <p>https://docs.python.org/3.3/library/stdtypes.html#bytes</p> <h3>String literals are described by the following lexical definitions:</h3> <p>https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals</p> <pre class="prettyprint"><code>stringliteral ::= [stringprefix](shortstring | longstring) stringprefix ::= "r" | "u" | "R" | "U" shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"' longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""' shortstringitem ::= shortstringchar | stringescapeseq longstringitem ::= longstringchar | stringescapeseq shortstringchar ::= <any source character except "\" or newline or the quote> longstringchar ::= <any source character except "\"> stringescapeseq ::= "\" <any source character> bytesliteral ::= bytesprefix(shortbytes | longbytes) bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" shortbytes ::= "'" shortbytesitem* "'" | '"' shortbytesitem* '"' longbytes ::= "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""' shortbytesitem ::= shortbyteschar | bytesescapeseq longbytesitem ::= longbyteschar | bytesescapeseq shortbyteschar ::= <any ASCII character except "\" or newline or the quote> longbyteschar ::= <any ASCII character except "\"> bytesescapeseq ::= "\" <any ASCII character> </code></pre>

How do I get rid of the b-prefix in a string in python?

Tags:

python

A bunch of the tweets I am importing are having this issue where they read

b'I posted a new photo to Facebook'

I gather the b indicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the b doesn't go away and is interferring in future code.

Is there a simple way to remove this b prefix from my lines of text?

Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web.

Here's the link content I'm analyzing:

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

new_tweets = 'content in the link'

Code Attempt

outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets] print(outtweets)

Error

UnicodeEncodeError                        Traceback (most recent call last) <ipython-input-21-6019064596bf> in <module>()       1 for screen_name in user_list: ----> 2     get_all_tweets(screen_name,"instance file")  <ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode)      99             with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f:     100                 writer = csv.writer(f) --> 101                 writer.writerows(outtweets)     102         else:     103             with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f:  C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)      17 class IncrementalEncoder(codecs.IncrementalEncoder):      18     def encode(self, input, final=False): ---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]      20       21 class IncrementalDecoder(codecs.IncrementalDecoder):  UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>

736

asked Jan 29 '17 08:01

Stan Shunpike

2 Answers

you need to decode the bytes of you want a string:

b = b'1234' print(b.decode('utf-8'))  # '1234'

149

answered Sep 22 '22 01:09

hiro protagonist

It is just letting you know that the object you are printing is not a string, rather a byte object as a byte literal. People explain this in incomplete ways, so here is my take.

Consider creating a byte object by typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a string object encoded in utf-8. (Note that converting here means decoding)

byte_object= b"test" # byte object by literally typing characters print(byte_object) # Prints b'test' print(byte_object.decode('utf8')) # Prints "test" without quotations

You see that we simply apply the .decode(utf8) function.

Bytes in Python

https://docs.python.org/3.3/library/stdtypes.html#bytes

String literals are described by the following lexical definitions:

https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

stringliteral   ::=  [stringprefix](shortstring | longstring) stringprefix    ::=  "r" | "u" | "R" | "U" shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"' longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""' shortstringitem ::=  shortstringchar | stringescapeseq longstringitem  ::=  longstringchar | stringescapeseq shortstringchar ::=  <any source character except "\" or newline or the quote> longstringchar  ::=  <any source character except "\"> stringescapeseq ::=  "\" <any source character>  bytesliteral   ::=  bytesprefix(shortbytes | longbytes) bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"' longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""' shortbytesitem ::=  shortbyteschar | bytesescapeseq longbytesitem  ::=  longbyteschar | bytesescapeseq shortbyteschar ::=  <any ASCII character except "\" or newline or the quote> longbyteschar  ::=  <any ASCII character except "\"> bytesescapeseq ::=  "\" <any ASCII character>

answered Sep 20 '22 01:09

Jonathan Komar

Related questions
                            
                                DateTimeField doesn't show in admin system
                            
                                Abstract attributes in Python [duplicate]
                            
                                Best way to find the months between two dates
                            
                                How can I strip first and last double quotes?
                            
                                Android Python Programming [closed]
                            
                                Why #egg=foo when pip-installing from git repo
                            
                                What is this odd colon behavior doing?
                            
                                Python split() without removing the delimiter [duplicate]
                            
                                Python's many ways of string formatting — are the older ones (going to be) deprecated?
                            
                                How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?
                            
                                Get __name__ of calling function's module in Python
                            
                                Why do I get TypeError: can't multiply sequence by non-int of type 'float'?
                            
                                Why isn't assigning to an empty list (e.g. [] = "") an error?
                            
                                How to resolve "dyld: Library not loaded: @executable_path.." error
                            
                                Why does csvwriter.writerow() put a comma after each character?
                            
                                Does Python have a toString() equivalent, and can I convert a class to String?
                            
                                Save list of DataFrames to multisheet Excel spreadsheet
                            
                                python selenium click on button
                            
                                Link to Flask static files with url_for
                            
                                Python Unicode Encode Error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With