Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

True definition of an English word?

Tags:

regex

nlp

What would be the best definition of an English word?

What are the other cases of an English word than just \w+? Some may include \w+-\w+ or \w+'\w+; some may exclude cases like \b[0-9]+\b. But I haven't seen any general consensus on those cases. Do we have a formal defintion of such? Can any of you clarify?

(Edit: broaden the question so it doesn't depend on regexp only.)

like image 994
OTZ Avatar asked Dec 21 '22 23:12

OTZ


2 Answers

I really don't think a regex is going to help you here, the problem with English (or any language for that matter) text is context. Without it you can be sure if what's between the word boundaries is text, a number, a random collection of characters, etc. For an NLP I think you are going to be selecting a subset of the language and looking for specific words rather than trying to extract all 'Words' from a string.

like image 91
Lazarus Avatar answered Jan 04 '23 12:01

Lazarus


The best way to check if a word is English is to look it up in a dictionary. If it's in an a dictionary of English words, than it is an english word. It is possible that a word could be in an English dictionary and a French dictionary also. For example 'me' is both a French and English word.

I'm sure you can find lots of downloadable dictionaries online. You can also make your own. For example, you could download the English version of Wikipedia and assume that all words found there are English words. You may or may not to filter out numbers.

A regular expression will not tell you whether a word is English. For instance xyvfg matches your pattern \w' but is certainly not an English word.

Edit: In theory, using English Phonology, it could be possible to tell whether a phonetic transcription of a word is pronounceable by an english speaker. There are lots of words pronounceable to english speakers which are not actually english words. This could take into account words that may appear in the english language in the future. However, translating between a phonetic transcription and text is quite a challenging problem as there can be many different spellings of the same phonetic transcription. I don't know if anyone has done anything like this. It could be an interesting theoretic excercise. I'm not sure this would be very useful in real world NLP though.

like image 25
Jay Askren Avatar answered Jan 04 '23 13:01

Jay Askren