Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get rid of the b-prefix in a string in python?

Tags:

python

A bunch of the tweets I am importing are having this issue where they read

b'I posted a new photo to Facebook' 

I gather the b indicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the b doesn't go away and is interferring in future code.

Is there a simple way to remove this b prefix from my lines of text?

Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web.


Here's the link content I'm analyzing:

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

new_tweets = 'content in the link' 

Code Attempt

outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets] print(outtweets) 

Error

UnicodeEncodeError                        Traceback (most recent call last) <ipython-input-21-6019064596bf> in <module>()       1 for screen_name in user_list: ----> 2     get_all_tweets(screen_name,"instance file")  <ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode)      99             with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f:     100                 writer = csv.writer(f) --> 101                 writer.writerows(outtweets)     102         else:     103             with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f:  C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)      17 class IncrementalEncoder(codecs.IncrementalEncoder):      18     def encode(self, input, final=False): ---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]      20       21 class IncrementalDecoder(codecs.IncrementalDecoder):  UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined> 
like image 736
Stan Shunpike Avatar asked Jan 29 '17 08:01

Stan Shunpike


People also ask

How do you remove the prefix b from a string in Python?

Using the codecs module: Converting byte string back to a character string. Thus on being converted to a normal string, the 'b' in the prefix is automatically gone.

What is b prefix in Python string?

The b prefix signifies a bytes string literal. If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object.

How do I remove a suffix from a string in Python?

There are multiple ways to remove whitespace and other characters from a string in Python. The most commonly known methods are strip() , lstrip() , and rstrip() . Since Python version 3.9, two highly anticipated methods were introduced to remove the prefix or suffix of a string: removeprefix() and removesuffix() .

How do you remove unwanted characters from a string in Python?

The most common way to remove a character from a string is with the replace() method, but we can also utilize the translate() method, and even replace one or more occurrences of a given character.


2 Answers

you need to decode the bytes of you want a string:

b = b'1234' print(b.decode('utf-8'))  # '1234' 
like image 149
hiro protagonist Avatar answered Sep 22 '22 01:09

hiro protagonist


It is just letting you know that the object you are printing is not a string, rather a byte object as a byte literal. People explain this in incomplete ways, so here is my take.

Consider creating a byte object by typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a string object encoded in utf-8. (Note that converting here means decoding)

byte_object= b"test" # byte object by literally typing characters print(byte_object) # Prints b'test' print(byte_object.decode('utf8')) # Prints "test" without quotations 

You see that we simply apply the .decode(utf8) function.

Bytes in Python

https://docs.python.org/3.3/library/stdtypes.html#bytes

String literals are described by the following lexical definitions:

https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

stringliteral   ::=  [stringprefix](shortstring | longstring) stringprefix    ::=  "r" | "u" | "R" | "U" shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"' longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""' shortstringitem ::=  shortstringchar | stringescapeseq longstringitem  ::=  longstringchar | stringescapeseq shortstringchar ::=  <any source character except "\" or newline or the quote> longstringchar  ::=  <any source character except "\"> stringescapeseq ::=  "\" <any source character>  bytesliteral   ::=  bytesprefix(shortbytes | longbytes) bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"' longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""' shortbytesitem ::=  shortbyteschar | bytesescapeseq longbytesitem  ::=  longbyteschar | bytesescapeseq shortbyteschar ::=  <any ASCII character except "\" or newline or the quote> longbyteschar  ::=  <any ASCII character except "\"> bytesescapeseq ::=  "\" <any ASCII character> 
like image 25
Jonathan Komar Avatar answered Sep 20 '22 01:09

Jonathan Komar