Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to remove punctuation marks, symbols, diacritics, special characters?

I use these lines of code to remove all punctuation marks, symbols, etc as you can see them in the array,

$pattern_page = array("+",",",".","-","'","\"","&","!","?",":",";","#","~","=","/","$","£","^","(",")","_","<",">");

$pg_url = str_replace($pattern_page, ' ', strtolower($pg_url));

but I want to make it simpler as it looks silly to list all the stuff I want to remove in the array as there might be some other special characters I want to remove.

I thought of using the regular expression below,

$pg_url = preg_replace("/\W+/", " ", $pg_url);

but it doesn't remove under-score - _

What is the best way to remove all these stuff? Can regular expression do that?

like image 855
Run Avatar asked Jan 21 '11 18:01

Run


People also ask

How do you remove special and punctuation characters in Python?

One of the easiest and fastest methods through which punctuation marks and special characters can be removed from a string is by using the translate () method. The built-in translate () function is available in the string library of Python.

How do you remove punctuations from regular expressions?

You can use this: Regex. Replace("This is a test string, with lots of: punctuations; in it?!.", @"[^\w\s]", "");

How do I remove special characters from text in R?

How to remove a character or multiple characters from a string in R? You can either use R base function gsub() or use str_replace() from stringr package to remove characters from a string or text.

How do you remove punctuations from a sentence in Python?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.


2 Answers

Depending on how greedy you'd like to be, you could do something like:

$pg_url = preg_replace("/[^a-zA-Z 0-9]+/", " ", $pg_url);

This will replace anything that isn't a letter, number or space.

like image 50
Nathan Taylor Avatar answered Oct 06 '22 13:10

Nathan Taylor


Use classes:

preg_replace('/[^[:alpha:]]/', '', $input);

Would remove anything that's not considered a "character" by the currently set locale. If it's punctuation, you seek to eliminate, the class would be [:punct:].

\W means "any non-word character" and is the opposite of \w which includes underscores (_).

like image 13
Linus Kleen Avatar answered Oct 06 '22 14:10

Linus Kleen