I have a difficulty using <code>\b</code> and greek characters in a regex. At this example <code>[a-zA-ZΆΈ-ώἀ-ῼ]*</code> succeeds to mark all the words I want (both greek and english). Now consider that I want to find words with 2 letters. For the English language I use something like this: <code>\b[a-zA-Z]{2}\b</code>. Can you help me write a regex that succeeds to mark words in Greek with 2 letters? (Why? My final goal is to remove them). text used: <blockquote> Greek MONOTONIC: Το γάρ ούν και παρ' υμίν λεγόμενον, ώ&sigmaf; ποτε Φαέθων Ηλίου παί&sigmaf; το του πατρό&sigmaf; άρμα ζεύξα&sigmaf; δια το μή δυνατό&sigmaf; είναι κατά την του πατρό&sigmaf; οδόν ελαύνειν τα τ' επί τη&sigmaf; γή&sigmaf; ξυνέκαυσε και αυτό&sigmaf; κεραυνωθεί&sigmaf; διεφθάρη, τούτο μύθου μέν σχήμα έχον λέγεται, το δέ αληθέ&sigmaf; εστι των περί γήν και κατ' ουρανόν ιόντων παράλλαξι&sigmaf; και διά μακρόν χρόνον γιγνομένη των επί γή&sigmaf; πυρί πολλώ φθορά. Greek POLYTONIC: Τὸ γὰρ οὖν καὶ παρ' ὑμῖν λεγόμενον, ὥ&sigmaf; ποτε Φαέθων Ἡλίου παῖ&sigmaf; τὸ τοῦ πατρὸ&sigmaf; ἅρμα ζεύξα&sigmaf; διὰ τὸ μὴ δυνατὸ&sigmaf; εἶναι κατὰ τὴν τοῦ πατρὸ&sigmaf; ὁδὸν ἐλαύνειν τὰ τ' ἐπὶ τῆ&sigmaf; γῆ&sigmaf; ξυνέκαυσε καὶ αὐτὸ&sigmaf; κεραυνωθεὶ&sigmaf; διεφθάρη, τοῦτο μύθου μὲν σχῆμα ἔχον λέγεται, τὸ δὲ ἀληθέ&sigmaf; ἐστι τῶν περὶ γῆν καὶ κατ' οὐρανὸν ἰόντων παράλλαξι&sigmaf; καὶ διὰ μακρὸν χρόνον γιγνομένη τῶν ἐπὶ τῆ&sigmaf; γῆ&sigmaf; πυρὶ πολλῷ φθορά. ENGLISH: For in truth the story that is told in your country as well as ours, how once upon a time Phaethon, son of Helios, yoked his father's chariot, and, because he was unable to drive it along the course taken by his father, burnt up all that was upon the earth and himself perished by a thunderbolt,—that story, as it is told, has the fashion of a legend, but the truth of it lies in the occurrence of a shifting of the bodies in the heavens which move round the earth, and a destruction of the things on the earth by fierce fire, which recurs at long intervals. </blockquote> what I've tried so far: <pre class="prettyprint"><code>// 1 txt = txt.replace(/\b[a-zA-ZΆΈ-ώἀ-ῼ]{2}\b/g, ''); // 2 tokens = txt.split(/\s+/); txt = tokens.filter(function(token){ return token.length > 2}).join(' '); // 3 tokens = txt.split(' '); txt = tokens.filter(function(token){ return token.length != 3}).join(' ') ); </code></pre> 2 & 3 were suggested to my question here: Javascript - regex - how to remove words with specified length EDIT Read also: <ul> <li>Why can't I use accented characters next to a word boundary?</li> <li>Javascript + Unicode regexes</li> </ul>

Since Javascript doesn't have the lookbehind feature and since word boundaries work only with members of the <code>\w</code> character class, the only way is to use groups (and capturing groups if you want to make a replacement): <pre class="prettyprint"><code>(?m)(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ]) </code></pre> example to remove 2 letters words: <pre class="prettyprint"><code>txt = txt.replace(/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])/gm, '\1'); </code></pre>

<h3>You can use \S</h3> Rather than write a match for "word characters plus these characters" it may be appropriate to use a regex that matches not-whitespace: <pre class="prettyprint"><code>\S </code></pre> It's broader in scope, but simpler to write/use. If that's too broad - use an exclusive list rather than an inclusive list: <pre class="prettyprint"><code>[^\s\.] </code></pre> That is - any character that is not whitespace and not a dot. In this way it's also easy to add to the exceptions. <h3>Don't try to use \b</h3> Word boundaries don't work with none-ascii characters which is easy to demonstrate: <pre class="prettyprint"><code>> "yay".match(/\b.*\b/) ["yay"] > "γaγ".match(/\b.*\b/) ["a"] </code></pre> Therefore it's not possible to use <code>\b</code> to detect words with greek characters - every character is a matching boundary. <h3>Match 2 character words</h3> The following pattern can be used to match two character words: <pre class="prettyprint"><code>pattern = /(^|[\s\.,])(\S{2})(?=$|[\s\.,])/g; </code></pre> (More accurately: to match two none-whitespace sequences). That is: <pre class="prettyprint"><code>(^|[\s\.,]) - start of string or whitespace/punctuation (back reference 1) (\S{2}) - two not-whitespace characters (back reference 2) ($|[\s\.,]) - end of string or whitespace/punctuation (positive lookahead) </code></pre> That pattern can be used like so to remove matching words: <pre class="prettyprint"><code>"input string".replace(pattern); </code></pre> Here's a jsfiddle demonstrating the patterns use on the texts in the question.

Javascript - regex - word boundary (\b) issue

Tags:

I have a difficulty using \b and greek characters in a regex.

At this example [a-zA-ZΆΈ-ώἀ-ῼ]* succeeds to mark all the words I want (both greek and english). Now consider that I want to find words with 2 letters. For the English language I use something like this: \b[a-zA-Z]{2}\b. Can you help me write a regex that succeeds to mark words in Greek with 2 letters? (Why? My final goal is to remove them).

text used:

Greek MONOTONIC: Το γάρ ούν και παρ' υμίν λεγόμενον, ώς ποτε Φαέθων Ηλίου παίς το του πατρός άρμα ζεύξας δια το μή δυνατός είναι κατά την του πατρός οδόν ελαύνειν τα τ' επί της γής ξυνέκαυσε και αυτός κεραυνωθείς διεφθάρη, τούτο μύθου μέν σχήμα έχον λέγεται, το δέ αληθές εστι των περί γήν και κατ' ουρανόν ιόντων παράλλαξις και διά μακρόν χρόνον γιγνομένη των επί γής πυρί πολλώ φθορά.

Greek POLYTONIC: Τὸ γὰρ οὖν καὶ παρ' ὑμῖν λεγόμενον, ὥς ποτε Φαέθων Ἡλίου παῖς τὸ τοῦ πατρὸς ἅρμα ζεύξας διὰ τὸ μὴ δυνατὸς εἶναι κατὰ τὴν τοῦ πατρὸς ὁδὸν ἐλαύνειν τὰ τ' ἐπὶ τῆς γῆς ξυνέκαυσε καὶ αὐτὸς κεραυνωθεὶς διεφθάρη, τοῦτο μύθου μὲν σχῆμα ἔχον λέγεται, τὸ δὲ ἀληθές ἐστι τῶν περὶ γῆν καὶ κατ' οὐρανὸν ἰόντων παράλλαξις καὶ διὰ μακρὸν χρόνον γιγνομένη τῶν ἐπὶ τῆς γῆς πυρὶ πολλῷ φθορά.

ENGLISH: For in truth the story that is told in your country as well as ours, how once upon a time Phaethon, son of Helios, yoked his father's chariot, and, because he was unable to drive it along the course taken by his father, burnt up all that was upon the earth and himself perished by a thunderbolt,—that story, as it is told, has the fashion of a legend, but the truth of it lies in the occurrence of a shifting of the bodies in the heavens which move round the earth, and a destruction of the things on the earth by fierce fire, which recurs at long intervals.

what I've tried so far:

// 1
txt = txt.replace(/\b[a-zA-ZΆΈ-ώἀ-ῼ]{2}\b/g, '');

// 2
tokens = txt.split(/\s+/);
txt = tokens.filter(function(token){ return token.length > 2}).join(' ');

// 3
tokens = txt.split(' ');
txt = tokens.filter(function(token){ return token.length != 3}).join(' ') );

2 & 3 were suggested to my question here: Javascript - regex - how to remove words with specified length

EDIT

tgogos

2 Answers

Since Javascript doesn't have the lookbehind feature and since word boundaries work only with members of the \w character class, the only way is to use groups (and capturing groups if you want to make a replacement):

(?m)(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])

example to remove 2 letters words:

txt = txt.replace(/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])/gm, '\1');

164

answered Oct 23 '22 21:10

Casimir et Hippolyte

You can use \S

Rather than write a match for "word characters plus these characters" it may be appropriate to use a regex that matches not-whitespace:

\S

It's broader in scope, but simpler to write/use.

If that's too broad - use an exclusive list rather than an inclusive list:

[^\s\.]

That is - any character that is not whitespace and not a dot. In this way it's also easy to add to the exceptions.

Don't try to use \b

Word boundaries don't work with none-ascii characters which is easy to demonstrate:

> "yay".match(/\b.*\b/)
["yay"]
> "γaγ".match(/\b.*\b/)
["a"]

Therefore it's not possible to use \b to detect words with greek characters - every character is a matching boundary.

Match 2 character words

The following pattern can be used to match two character words:

pattern = /(^|[\s\.,])(\S{2})(?=$|[\s\.,])/g;

(More accurately: to match two none-whitespace sequences).

That is:

(^|[\s\.,]) - start of string or whitespace/punctuation (back reference 1)
(\S{2})     - two not-whitespace characters (back reference 2)
($|[\s\.,]) - end of string or whitespace/punctuation (positive lookahead)

That pattern can be used like so to remove matching words:

"input string".replace(pattern);

Here's a jsfiddle demonstrating the patterns use on the texts in the question.

answered Oct 23 '22 20:10

AD7six

Related questions
                            
                                JPEG images have different pixel values across multiple devices
                            
                                How do I make a private function in Swift? [duplicate]
                            
                                Node/JavaScript glob file/path matching syntax, wildcards, etc
                            
                                Spring-Data-Rest Validator
                            
                                Python NLTK: Bigrams trigrams fourgrams
                            
                                Azure Web Role "warm up" strategies [closed]
                            
                                iOS 8 UITableView background color appearance
                            
                                Diff only changed parts of lines
                            
                                Constructors : difference between defaulting and delegating a parameter
                            
                                Recursive definitions in Pandas
                            
                                Python Setup.py Build_Ext --inplace
                            
                                Threads and tkinter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With