My HTML code with Devanagari words
<html>
<head>
<title>TODO</title>
<meta charset="UTF-8">
</head>
<body>
मंत्री मुख्यमंत्री
</body>
<script src="jquery-1.11.0.min.js"></script>
<script src="xregexp_20.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script src="addons/unicode/unicode-scripts.js"></script>
<script src="my.js"></script>
</html>
My javascript code
var html = document.getElementsByTagName("html")[0];
var fullpage_content = html.innerHTML;
var regex = RegExp("मंत्री", "g");
var count = fullpage_content.match(regex);
console.log("count in page : " + count+ ", " + count.length);
//use of word boundry ,not supported by devanagari characters
regex = RegExp("\\bमंत्री\\b", "g");
count = fullpage_content.match(regex);
console.log("count in page : " + count);
regex = XRegExp("मंत्री");
var match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);
//xregex do not support word boundry \\b
regex = XRegExp("\\bमंत्री\\b");
match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);
Output of js (on Chrome)
count in page : मंत्री,मंत्री, 2
count in page : null
count in page : मंत्री,मंत्री, 2
count in page : , 0
Whole word search should give one as answer, but regexp and XRegExp both are failing me. I need some help.
Using this regexp I can get a match on मंत्री but exclude मुख्यमंत्री:
var regex = XRegExp("(?:^|\\P{L})मंत्री(?=\\P{L}|$)");
What this does is match मंत्री if it:
Is at the beginning of the string or preceded by a character which Unicode considers a non-Letter, and
Is at the end of the string or followed by a character which Unicode considers a non-Letter.
Note that this is slightly different from what \b
does because \b
does not match digits. For instance, /\bmantri\b/
won't match mantri123
because 1
, 2
, and 3
are considered to be part of words and thus do not mark a word boundary. If you want something that emulates \b
then this would do it:
var regex = XRegExp("(?:^|[^\\p{L}\\p{N}])मंत्री(?=[^\\p{L}\\p{N}]|$)");
The difference with the first regexp is that with this one मंत्री cannot be preceded or followed by a digit.
I've used a negative lookahead at the end of the regular expression so the character that follows your word is excluded from the results. There is no equivalent negative lookbehind so if there is a character before मंत्री, it will appear in the results. You'll have to decide what you want to do with this character for your specific application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With