Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ban words with diacritics using a blacklist array and regex?

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:

var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');

$(function () {
  $("input").on("change", function () {
    var valid = !regex.test(this.value);
    alert(valid);
  });
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>

Now on the word băţ it returns true instead of false for example.

like image 427
Ionut Avatar asked Aug 25 '16 08:08

Ionut


4 Answers

Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.

For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:

var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');

$(function() {
  $("input").on("input", function() {
    var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
    $('#log').html(invalid ? 'bad' : 'good');
  });
  $("input").trigger("input").focus();

  function dashPaddedWords(str) {
    return '-' + str.replace(/./g, wordCharOrDash) + '-';
  };

  function wordCharOrDash(ch) {
    return isWordChar(ch) ? ch : '-'
  };

  function isWordChar(ch) {
    return ch.toUpperCase() != ch.toLowerCase();
  };
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>
like image 198
myf Avatar answered Oct 23 '22 10:10

myf


Let's see what's going on:

alert("băţ".match(/\w\b/));

This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so , , and match \w\b\W since they contain a word character, a word boundary, and a non-word character.

I think the best you can do is something like this:

var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
                       + bannedWords.join('|')
                       + ')(?=' + bound + '|$)', 'i');

where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)

Given ["bad", "mad", "testing", "băţ"], this becomes:

/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i

This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).

You can also use var bound = '[\\s!-/:-@[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.

like image 22
Adam Katz Avatar answered Oct 23 '22 10:10

Adam Katz


You need a Unicode aware word boundary. The easiest way is to use XRegExp package.

Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:

\b                     word            \b
  ---------------------------------------
 |                       |               |
([^\pL0-9_]|^)         word       (?=[^\pL0-9_]|$)

The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.

The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.

See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.

var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');

$(function () {
  $("input").on("change", function () {
    var valid = !regex.test(this.value);
    //alert(valid);
    console.log("The word is", valid ? "allowed" : "banned");
  });
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>
like image 2
Wiktor Stribiżew Avatar answered Oct 23 '22 10:10

Wiktor Stribiżew


In stead of using word boundary, you could do it with

(?:[^\w\u0080-\u02af]+|^)

to check for start of word, and

(?=[^\w\u0080-\u02af]|$)

to check for the end of it.

The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.

Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.

Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.

var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
    regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');

function myFunction() {
    document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}
<!DOCTYPE html>
<html>
<body>

Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>

<p id='result'></p>

</body>
</html>
like image 2
SamWhan Avatar answered Oct 23 '22 10:10

SamWhan