A bunch of the tweets I am importing are having this issue where they read
b'I posted a new photo to Facebook'
I gather the b
indicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the b
doesn't go away and is interferring in future code.
Is there a simple way to remove this b
prefix from my lines of text?
Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web.
Here's the link content I'm analyzing:
https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0
new_tweets = 'content in the link'
outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets] print(outtweets)
UnicodeEncodeError Traceback (most recent call last) <ipython-input-21-6019064596bf> in <module>() 1 for screen_name in user_list: ----> 2 get_all_tweets(screen_name,"instance file") <ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode) 99 with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f: 100 writer = csv.writer(f) --> 101 writer.writerows(outtweets) 102 else: 103 with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f: C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final) 17 class IncrementalEncoder(codecs.IncrementalEncoder): 18 def encode(self, input, final=False): ---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0] 20 21 class IncrementalDecoder(codecs.IncrementalDecoder): UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>
Using the codecs module: Converting byte string back to a character string. Thus on being converted to a normal string, the 'b' in the prefix is automatically gone.
The b prefix signifies a bytes string literal. If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object.
There are multiple ways to remove whitespace and other characters from a string in Python. The most commonly known methods are strip() , lstrip() , and rstrip() . Since Python version 3.9, two highly anticipated methods were introduced to remove the prefix or suffix of a string: removeprefix() and removesuffix() .
The most common way to remove a character from a string is with the replace() method, but we can also utilize the translate() method, and even replace one or more occurrences of a given character.
you need to decode the bytes
of you want a string:
b = b'1234' print(b.decode('utf-8')) # '1234'
It is just letting you know that the object you are printing is not a string, rather a byte object as a byte literal. People explain this in incomplete ways, so here is my take.
Consider creating a byte object by typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a string object encoded in utf-8. (Note that converting here means decoding)
byte_object= b"test" # byte object by literally typing characters print(byte_object) # Prints b'test' print(byte_object.decode('utf8')) # Prints "test" without quotations
You see that we simply apply the .decode(utf8)
function.
https://docs.python.org/3.3/library/stdtypes.html#bytes
https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals
stringliteral ::= [stringprefix](shortstring | longstring) stringprefix ::= "r" | "u" | "R" | "U" shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"' longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""' shortstringitem ::= shortstringchar | stringescapeseq longstringitem ::= longstringchar | stringescapeseq shortstringchar ::= <any source character except "\" or newline or the quote> longstringchar ::= <any source character except "\"> stringescapeseq ::= "\" <any source character> bytesliteral ::= bytesprefix(shortbytes | longbytes) bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" shortbytes ::= "'" shortbytesitem* "'" | '"' shortbytesitem* '"' longbytes ::= "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""' shortbytesitem ::= shortbyteschar | bytesescapeseq longbytesitem ::= longbyteschar | bytesescapeseq shortbyteschar ::= <any ASCII character except "\" or newline or the quote> longbyteschar ::= <any ASCII character except "\"> bytesescapeseq ::= "\" <any ASCII character>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With