You can use this: Regex. Replace("This is a test string, with lots of: punctuations; in it?!.", @"[^\w\s]", "");
Use regex to Strip Punctuation From a String in Python The regex pattern [^\w\s] captures everything which is not a word or whitespace(i.e. the punctuations) and replaces it with an empty string.
What are the 14 Punctuation Marks in English? There are 14 punctuation marks that are used in the English language. They are: the period, question mark, exclamation point, comma, colon, semicolon, dash, hyphen, brackets, braces, parentheses, apostrophe, quotation mark, and ellipsis.
If you want to remove specific punctuation from a string, it will probably be best to explicitly remove exactly what you want like
replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"")
Doing the above still doesn't return the string as you have specified it. If you want to remove any extra spaces that were left over from removing crazy punctuation, then you are going to want to do something like
replace(/\s{2,}/g," ");
My full example:
var s = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var punctuationless = s.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"");
var finalString = punctuationless.replace(/\s{2,}/g," ");
Results of running code in firebug console:
str = str.replace(/[^\w\s]|_/g, "")
.replace(/\s+/g, " ");
Removes everything except alphanumeric characters and whitespace, then collapses multiple adjacent whitespace to single spaces.
Detailed explanation:
\w
is any digit, letter, or underscore.\s
is any whitespace.[^\w\s]
is anything that's not a digit, letter, whitespace, or underscore.[^\w\s]|_
is the same as #3 except with the underscores added back in.Here are the standard punctuation characters for US-ASCII: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
For Unicode punctuation (such as curly quotes, em-dashes, etc), you can easily match on specific block ranges. The General Punctuation block is \u2000-\u206F
, and the Supplemental Punctuation block is \u2E00-\u2E7F
.
Put together, and properly escaped, you get the following RegExp:
/[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/
That should match pretty much any punctuation you encounter. So, to answer the original question:
var punctRE = /[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/g;
var spaceRE = /\s+/g;
var str = "This, -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
str.replace(punctRE, '').replace(spaceRE, ' ');
>> "This is an example of a string with punctuation"
US-ASCII source: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#posix
Unicode source: http://kourge.net/projects/regexp-unicode-block
/[^A-Za-z0-9\s]/g should match all punctuation but keep the spaces.
So you can use .replace(/\s{2,}/g, " ")
to replace extra spaces if you need to do so. You can test the regex in http://rubular.com/
.replace(/[^A-Za-z0-9\s]/g,"").replace(/\s{2,}/g, " ")
Update: Will only work if the input is ANSI English.
I ran across the same issue, this solution did the trick and was very readable:
var sentence = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newSen = sentence.match(/[^_\W]+/g).join(' ');
console.log(newSen);
Result:
"This is an example of a string with punctuation"
The trick was to create a negated set. This means that it matches anything that is not within the set i.e. [^abc]
- not a, b or c
\W
is any non-word, so [^\W]+
will negate anything that is not a word char.
By adding in the _ (underscore) you can negate that as well.
Make it apply globally /g
, then you can run any string through it and clear out the punctuation:
/[^_\W]+/g
Nice and clean ;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With