Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c# Regex non letter characters from a string

Tags:

c#

regex

My terminology may be a little out here, but i am trying to strip out non letters from a string in C#, so remove dashes ampersands etc, but retain things like accented characters and Chinese characters. All the C# examples i have seen on SO have a regex like this new Regex("[^a-zA-Z0-9 -]");, but my needs are beyond ascii characters.

string input = "I- +AM. 相关 AZURÉE& /30%";

string output = "I AM 相关 AZURÉE 30";

like image 652
PeteN Avatar asked Jul 18 '13 11:07

PeteN


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr. Stroustroupe.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.

How old is the letter C?

The letter c was applied by French orthographists in the 12th century to represent the sound ts in English, and this sound developed into the simpler sibilant s.


2 Answers

A good starting point would be to remove characters according to their Unicode character class. For example, this code removes everything that is characterized as punctuation, symbol or a control character:

string input = "I- +AM. 相关 AZURÉE& /30%";
var output = Regex.Replace(input, "[\\p{S}\\p{C}\\p{P}]", "");

You could also try the whitelisting approach, by only allowing certain classes. For example, this keeps only characters that are letters, diacritics, digits and spacing:

var output = Regex.Replace(input, "[^\\p{L}\\p{M}\\p{N}\\p{Z}]", "");

See it in action.

like image 118
Jon Avatar answered Sep 30 '22 00:09

Jon


string result = string.Concat(input.Where(c => Char.IsLetterOrDigit(c)));
like image 44
Louis Ricci Avatar answered Sep 30 '22 02:09

Louis Ricci