My JavaScript is quite rusty so any help with this would be great. I have a requirement to detect non printable characters (control characters like SOH, BS etc) as well extended ascii characters such as Ž in a string and remove them but I am not sure how to write the code?
Can anyone point me in the right direction for how to go about this? This is what I have so far:
$(document).ready(function() {
$('.jsTextArea').blur(function() {
var pattern = /[^\000-\031]+/gi;
var val = $(this).val();
if (pattern.test(val)) {
for (var i = 0; i < val.length; i++) {
var res = val.charAt([i]);
alert("Character " + [i] + " " + res);
}
}
else {
alert("It failed");
}
});
});
replaceAll("\\p{Cntrl}", "?"); The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20] ), including accented characters: my_string.
Bring up the command palette with CTRL+SHIFT+P (Windows, Linux) or CMD+SHIFT+P on Mac. Type Remove Non ASCII Chars until you see the commands. Select Remove non Ascii characters (File) for removing in the entire file, or Remove non Ascii characters (Select) for removing only in the selected text.
Option #1 - Show All Characters Then, go to the menu and select View->Show Symbol->Show All Characters . All characters will become visible, but you will have to scroll through the whole file to see which character needs to be removed.
To target characters that are not part of the printable basic ASCII range, you can use this simple regex:
[^ -~]+
Explanation: in the first 128 characters of the ASCII table, the printable range starts with the space character and ends with a tilde. These are the characters you want to keep. That range is expressed with [ -~]
, and the characters not in that range are expressed with [^ -~]
. These are the ones we want to replace. Therefore:
result = string.replace(/[^ -~]+/g, "");
No need to test, you can directly process the text box content:
textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');
where the range \x20-\x7E
covers the printable part of the ascii table.
Example with your code:
$('.jsTextArea').blur(function() {
this.value = this.value.replace(/[^\x20-\x7E]+/g, '');
});
For anyone looking for a solution that works beyond ascii and does not strip out Unicode chars:
function stripNonPrintableAndNormalize(text) {
// strip control chars
text = text.replace(/\p{C}/gu, '');
// other common tasks are to normalize newlines and other whitespace
// normalize newline
text = text.replace(/\n\r/g, '\n');
text = text.replace(/\p{Zl}/gu, '\n');
text = text.replace(/\p{Zp}/gu, '\n');
// normalize space
text = text.replace(/\p{Zs}/gu, ' ');
return text;
}
The various unicode class identifiers (e.g. Zl
for line separator) are defined at https://www.unicode.org/reports/tr44/ as also shown below:
Abbr | Long | Description |
---|---|---|
Lu | Uppercase_Letter | an uppercase letter |
Ll | Lowercase_Letter | a lowercase letter |
Lt | Titlecase_Letter | a digraphic character, with first part uppercase |
LC | Cased_Letter | Lu | Ll | Lt |
Lm | Modifier_Letter | a modifier letter |
Lo | Other_Letter | other letters, including syllables and ideographs |
L | Letter | Lu | Ll | Lt | Lm | Lo |
Mn | Nonspacing_Mark | a nonspacing combining mark (zero advance width) |
Mc | Spacing_Mark | a spacing combining mark (positive advance width) |
Me | Enclosing_Mark | an enclosing combining mark |
M | Mark | Mn | Mc | Me |
Nd | Decimal_Number | a decimal digit |
Nl | Letter_Number | a letterlike numeric character |
No | Other_Number | a numeric character of other type |
N | Number | Nd | Nl | No |
Pc | Connector_Punctuation | a connecting punctuation mark, like a tie |
Pd | Dash_Punctuation | a dash or hyphen punctuation mark |
Ps | Open_Punctuation | an opening punctuation mark (of a pair) |
Pe | Close_Punctuation | a closing punctuation mark (of a pair) |
Pi | Initial_Punctuation | an initial quotation mark |
Pf | Final_Punctuation | a final quotation mark |
Po | Other_Punctuation | a punctuation mark of other type |
P | Punctuation | Pc | Pd | Ps | Pe | Pi | Pf | Po |
Sm | Math_Symbol | a symbol of mathematical use |
Sc | Currency_Symbol | a currency sign |
Sk | Modifier_Symbol | a non-letterlike modifier symbol |
So | Other_Symbol | a symbol of other type |
S | Symbol | Sm | Sc | Sk | So |
Zs | Space_Separator | a space character (of various non-zero widths) |
Zl | Line_Separator | U+2028 LINE SEPARATOR only |
Zp | Paragraph_Separator | U+2029 PARAGRAPH SEPARATOR only |
Z | Separator | Zs | Zl | Zp |
Cc | Control | a C0 or C1 control code |
Cf | Format | a format control character |
Cs | Surrogate | a surrogate code point |
Co | Private_Use | a private-use character |
Cn | Unassigned | a reserved unassigned code point or a noncharacter |
C | Other | Cc | Cf | Cs | Co | Cn |
You have to assign a pattern (instead of string) into isNonAscii
variable, then use test()
to check if it matches. test()
returns true or false.
$(document).ready(function() {
$('.jsTextArea').blur(function() {
var pattern = /[^\000-\031]+/gi;
var val = $(this).val();
if (pattern.test(val)) {
alert("It matched");
}
else {
alert("It did NOT match");
}
});
});
Check jsFiddle
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With