Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all punctuation from string, except if it's between digits

I have a text that contains words and numbers. I'll give a representative example of the text:

string = "This is a 1example of the text. But, it only is 2.5 percent of all data"

I'd like to convert it to something like:

"This is a  1 example of the text But it only is  2.5  percent of all data"

So removing punctuation (can be . , or any other in string.punctuation) and also put a space between digits and words when it is concatenated. But keep the floats like 2.5 in my example.

I used the following code:

item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item

The result is :

 >> "This is a  1 example of the text. But, it only is  2 . 5  percent of all data"

I'm almost there but can't figure out that last peace.

like image 213
deltascience Avatar asked Oct 18 '25 14:10

deltascience


2 Answers

You can use regex lookarounds like this:

(?<!\d)[.,;:](?!\d)

Working demo

The idea is to have a character class gathering the punctuation you want to replace and use lookarounds to match punctuation that does not have digits around

regex = r"(?<!\d)[.,;:](?!\d)"

test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"

result = re.sub(regex, "", test_str, 0)

Result is:

This is a 1example of the text But it only is 2.5 percent of all data
like image 59
Federico Piazza Avatar answered Oct 20 '25 10:10

Federico Piazza


Okay folks, here is an answer (the best ? I don't know but it seems to work) :

item = "This is a 1example 2Ex of the text.But, it only is 2.5 percent of all data?"
#if there is two strings contatenated with the second starting with capital letter
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
#if a word starts with a digit like "1example"
item = ' '.join(re.split(r'(\d+)([A-Za-z]+)', item) )
#Magical line that removes punctuation apart from floats
item = re.sub('\S+', lambda m: re.match(r'^\W*(.*\w)\W*$', m.group()).group(1), item)
item = item.replace("  "," ")
print item
like image 43
deltascience Avatar answered Oct 20 '25 11:10

deltascience



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!