Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match non printable/non ascii characters and remove from text

My JavaScript is quite rusty so any help with this would be great. I have a requirement to detect non printable characters (control characters like SOH, BS etc) as well extended ascii characters such as Ž in a string and remove them but I am not sure how to write the code?

Can anyone point me in the right direction for how to go about this? This is what I have so far:

$(document).ready(function() {
    $('.jsTextArea').blur(function() {
        var pattern = /[^\000-\031]+/gi;
        var val = $(this).val();
        if (pattern.test(val)) {    
        for (var i = 0; i < val.length; i++) {
            var res = val.charAt([i]);
                alert("Character " + [i] + " " + res);              
        }          
    }
    else {
         alert("It failed");
     }

    });
});
like image 783
Grant Doole Avatar asked Jun 15 '14 11:06

Grant Doole


People also ask

How do I remove non-printable characters from a string?

replaceAll("\\p{Cntrl}", "?"); The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20] ), including accented characters: my_string.

How do I remove non ASCII characters from a text file?

Bring up the command palette with CTRL+SHIFT+P (Windows, Linux) or CMD+SHIFT+P on Mac. Type Remove Non ASCII Chars until you see the commands. Select Remove non Ascii characters (File) for removing in the entire file, or Remove non Ascii characters (Select) for removing only in the selected text.

How do I find non-printable characters in a text file?

Option #1 - Show All Characters Then, go to the menu and select View->Show Symbol->Show All Characters . All characters will become visible, but you will have to scroll through the whole file to see which character needs to be removed.


4 Answers

To target characters that are not part of the printable basic ASCII range, you can use this simple regex:

[^ -~]+

Explanation: in the first 128 characters of the ASCII table, the printable range starts with the space character and ends with a tilde. These are the characters you want to keep. That range is expressed with [ -~], and the characters not in that range are expressed with [^ -~]. These are the ones we want to replace. Therefore:

result = string.replace(/[^ -~]+/g, "");
like image 50
zx81 Avatar answered Oct 19 '22 04:10

zx81


No need to test, you can directly process the text box content:

textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');

where the range \x20-\x7E covers the printable part of the ascii table.

Example with your code:

$('.jsTextArea').blur(function() {
    this.value = this.value.replace(/[^\x20-\x7E]+/g, '');
});
like image 35
Casimir et Hippolyte Avatar answered Oct 19 '22 06:10

Casimir et Hippolyte


For anyone looking for a solution that works beyond ascii and does not strip out Unicode chars:

function stripNonPrintableAndNormalize(text) {
    // strip control chars
    text = text.replace(/\p{C}/gu, '');

    // other common tasks are to normalize newlines and other whitespace

    // normalize newline
    text = text.replace(/\n\r/g, '\n');
    text = text.replace(/\p{Zl}/gu, '\n');
    text = text.replace(/\p{Zp}/gu, '\n');

    // normalize space
    text = text.replace(/\p{Zs}/gu, ' ');

    return text;
}

The various unicode class identifiers (e.g. Zl for line separator) are defined at https://www.unicode.org/reports/tr44/ as also shown below:

Abbr Long Description
Lu Uppercase_Letter an uppercase letter
Ll Lowercase_Letter a lowercase letter
Lt Titlecase_Letter a digraphic character, with first part uppercase
LC Cased_Letter Lu | Ll | Lt
Lm Modifier_Letter a modifier letter
Lo Other_Letter other letters, including syllables and ideographs
L Letter Lu | Ll | Lt | Lm | Lo
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
Mc Spacing_Mark a spacing combining mark (positive advance width)
Me Enclosing_Mark an enclosing combining mark
M Mark Mn | Mc | Me
Nd Decimal_Number a decimal digit
Nl Letter_Number a letterlike numeric character
No Other_Number a numeric character of other type
N Number Nd | Nl | No
Pc Connector_Punctuation a connecting punctuation mark, like a tie
Pd Dash_Punctuation a dash or hyphen punctuation mark
Ps Open_Punctuation an opening punctuation mark (of a pair)
Pe Close_Punctuation a closing punctuation mark (of a pair)
Pi Initial_Punctuation an initial quotation mark
Pf Final_Punctuation a final quotation mark
Po Other_Punctuation a punctuation mark of other type
P Punctuation Pc | Pd | Ps | Pe | Pi | Pf | Po
Sm Math_Symbol a symbol of mathematical use
Sc Currency_Symbol a currency sign
Sk Modifier_Symbol a non-letterlike modifier symbol
So Other_Symbol a symbol of other type
S Symbol Sm | Sc | Sk | So
Zs Space_Separator a space character (of various non-zero widths)
Zl Line_Separator U+2028 LINE SEPARATOR only
Zp Paragraph_Separator U+2029 PARAGRAPH SEPARATOR only
Z Separator Zs | Zl | Zp
Cc Control a C0 or C1 control code
Cf Format a format control character
Cs Surrogate a surrogate code point
Co Private_Use a private-use character
Cn Unassigned a reserved unassigned code point or a noncharacter
C Other Cc | Cf | Cs | Co | Cn
like image 2
mwag Avatar answered Oct 19 '22 04:10

mwag


You have to assign a pattern (instead of string) into isNonAscii variable, then use test() to check if it matches. test() returns true or false.

$(document).ready(function() {
    $('.jsTextArea').blur(function() {
        var pattern = /[^\000-\031]+/gi;
        var val = $(this).val();
        if (pattern.test(val)) {
            alert("It matched");
        }
        else {
            alert("It did NOT match");
        }
    });
});

Check jsFiddle

like image 1
kosmos Avatar answered Oct 19 '22 05:10

kosmos