Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenize text but keep compund hyphenated words together

Tags:

python

regex

nlp

I am trying to clean up text using a pre-processing function. I want to remove all non-alpha characters such as punctuation and digits, but I would like to retain compound words that use a dash without splitting them (e.g. pre-tender, pre-construction).

def preprocess(text):
  #remove punctuation
  text = re.sub('\b[A-Za-z]+(?:-+[A-Za-z]+)+\b', '-', text)
  text = re.sub('[^a-zA-Z]', ' ', text)
  text = text.split()
  text = " ".join(text)
  return text

For instance, the original text:

"Attended pre-tender meetings" 

should be split into

['attended', 'pre-tender', 'meeting'] 

rather than

['attended', 'pre', 'tender', 'meeting']

Any help would be appreciated!

like image 654
Santiago Gomez Avatar asked Nov 19 '25 21:11

Santiago Gomez


1 Answers

To remove all non-alpha characters but - between letters, you can use

[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))

ASCII only equivalent:

[^A-Za-z](?<![A-Za-z]-(?=[A-Za-z]))

See the regex demo. Details:

  • [\W\d_] - any non-letter
  • (?<![^\W\d_]-(?=[^\W\d_])) - a negative lookbehind that fails the match if there is a letter and a - immediately to the left, and right after -, there is any letter (checked with the (?=[^\W\d_]) positive lookahead).

See the Python demo:

import re

def preprocess(text):
  #remove all non-alpha characters but - between letters
  text = re.sub(r'[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))', r' ', text)
  return " ".join(text.split())

print(preprocess("Attended pre-tender, etc meetings."))
# => Attended pre-tender etc meetings
like image 153
Wiktor Stribiżew Avatar answered Nov 21 '25 11:11

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!