I am trying to extract English titles from a wiki titles dump that's in a text file using regex in Python 3. The wiki dump contains titles in other languages also and some symbols. Below is my code:
with open('/Users/some/directory/title.txt', 'rb')as f:
text=f.read()
letters_only = re.sub(b"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
But I am getting an error:
TypeError: sequence item 1: expected a bytes-like object, str found
at the line: letters_only = re.sub(b"[^a-zA-Z]", " ", text)
But, I am using b''
to make output as byte type, below is a sample of the text file:
Destroy-Oh-Boy!!
!!Que_Corra_La_Voz!!
!!_(chess)
!!_(disambiguation)
!'O!Kung
!'O!Kung_language
!'O-!khung_language
!337$P34K
!=
!?
!?!
!?Revolution!?
!?_(chess)
!A_Luchar!
!Action_Pact!
!Action_pact!
!Adios_Amigos!
!Alabadle!
!Alarma!
!Alarma!_(album)
!Alarma!_(disambiguation)
!Alarma!_(magazine)
!Alarma!_Records
!Alarma!_magazine
!Alfaro_Vive,_Carajo!
!All-Time_Quarterback!
!All-Time_Quarterback!_(EP)
!All-Time_Quarterback!_(album)
!Alla_tu!
!Amigos!
!Amigos!_(Arrested_Development_episode)
!Arriba!_La_Pachanga
!Ask_a_Mexican!
!Atame!
!Ay,_Carmela!_(film)
!Ay,_caramba!
!BANG!
!Bang!
!Bang!_TV
!Basta_Ya!
!Bastardos!
!Bastardos!_(album)
!Bastardos_en_Vivo!
!Bienvenido,_Mr._Marshall!
!Ciauetistico!
!Ciautistico!
!DOCTYPE
!Dame!_!Dame!_!Dame!
!Decapitacion!
!Dos!
!Explora!_Science_Center_and_Children's_Museum
!F
!Forward,_Russia!
!Forward_Russia!
!Ga!ne_language
!Ga!nge_language
!Gã!ne
!Gã!ne_language
!Gã!nge_language
!HERO
!Happy_Birthday_Guadaloupe!
!Happy_Birthday_Guadalupe!
!Hello_Friends
I have searched online but could not succeed. Any help will be appreciated.
The problem is with the repl
argument you supply, it isn't a bytes
object:
letters_only = re.sub(b"[^a-zA-Z]", " ", b'Hello2World')
# TypeError: sequence item 1: expected a bytes-like object, str found
Instead, supply repl
as a bytes instance b" "
:
letters_only = re.sub(b"[^a-zA-Z]", b" ", b'Hello2World')
print(letters_only)
b'Hello World'
Note: Don't prefix your literals with b
and don't open the file with rb
if you aren't looking for byte
sequences.
You have to choose between binary and text mode.
Either you open your file as rb
and then you can use re.sub(b"[^a-zA-Z]", b" ", text)
(text
is a bytes
object)
Or you open your file as r
and then you can use re.sub("[^a-zA-Z]", " ", text)
(text
is a str
object)
The second solution is more "classical".
You can't use a byte
string for your regex match when the replacement string isn't.
Essentially, you can't mix different objects (byte
s and string
s) when doing most tasks. In your code above, you are using a binary search string and a binary text, but your replacement string is a regular string
. All arguments need to be of the same type, so there are 2 possible solutions to this.
Taking the above into account, your code could look like this (this will return regular string
strings, not byte
objects):
with open('/Users/some/directory/title.txt', 'r')as f:
text=f.read()
letters_only = re.sub(r"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
Note that the code does use a special type of string for the regex - a raw string, prefixed with r
. This means that python won't interpret escape characters such as \
, which is very useful for regexes. See the docs for more details about raw strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With