Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to split words in Python

Tags:

I was designing a regex to split all the actual words from a given text:


Input Example:

"John's mom went there, but he wasn't there. So she said: 'Where are you'" 


Expected Output:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"] 



I thought of a regex like that:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)" 

After splitting in Python, the result contains None items and empty spaces.

How to get rid of the None items? And why didn't the spaces match?


Edit:
Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]

like image 754
Betamoo Avatar asked Oct 03 '12 09:10

Betamoo


People also ask

How do you split a string in regex in Python?

If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter.

Can I use regex with split Python?

Regex to Split string with multiple delimitersWith the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string's split() method, you could have used only a fixed character or set of characters to split a string.

How do you split a word in Python?

Python String split() MethodThe split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.


1 Answers

Instead of regex, you can use string-functions:

to_be_removed = ".,:!" # all characters to be removed s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"  for c in to_be_removed:     s = s.replace(c, '') s.split() 

BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.

EDIT: probably a simple regex can solve your porblem:

(\w[\w']*) 

It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.

(\w[\w']*\w) 

This second regex is for a very specific situation.... First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.

Example:

rgx = re.compile("([\w][\w']*\w)") s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'" rgx.findall(s)  ["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you'] 

UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:

(\w[\w']*\w|\w)  rgx = re.compile("(\w[\w']*\w|\w)") s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'" rgx.findall(s)  ["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a'] 
like image 178
FallenAngel Avatar answered Sep 26 '22 03:09

FallenAngel