Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tidy up a string

Tags:

c#

regex

I'm looking for the best solution, performance wise, to rebuild a string by removing words that are not complete words. An acceptable word in this instance is a whole word without numbers or doesn't start with a forward slash, or a back slash. So just letters only, but can include hyphen and apostrophe's

For example:

String str ="\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/"

Using the above I'd need a new string that returns the following:

Str = "this is a word, frank's place"

I've done some research on Regex, but I can't find anything that would do what I need.

Final Code Snippet

var resultSet = Regex.Matches(item.ToLower(), @"(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)")
                .Cast<Match>()
                .Select(m => m.Value).ToArray();

Thanks for all your input guys - proves what a great site this is

like image 381
CSharpNewBee Avatar asked Jun 25 '13 21:06

CSharpNewBee


2 Answers

Description

Based on your comments: A word in this instance is:

a whole word without numbers 
doesn't start with a forward slash, or a back slash
just letters only
can include hyphen and apostrophes

The character class to cover all the word characters by your definition would be [a-z'-]+ and that group could be surrounded by whitespace, or the start/end of a string. You sample also shows a comma so I'm presuming a word can be followed by a comma or dot either of which are followed by white space is ok too.

This regex will:

  • collect all substings defined as words [a-z'-]+
  • allow a comma or dot after a word, but not inside or at the start of a word
  • rejects substrings from containing all hyphens
  • rejects substrings from containing all apostrophes
  • prevents words from having 3 or more hyphens
  • prevents words from having 2 or more apostrophes

(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)

enter image description here

Expanded explanation

  • (?:^|\s) match the start of the string or a white space. This eliminates the need to test for word boundary which is problematic for strings like "abdc-egfh"
  • (?![\\\/]) prevent the word from starting with a \ or /, however this is over kill as the character class doesn't allow it either
  • (?!-+(?:\s|$)) prevent strings which are all hyphens
  • (?!'+(?:\s|$)) prevent strings which are all apostrophes
  • (?!(?:[a-z'-]*?-){3,}) prevent strings which have 3 or more hyphens
  • (?!(?:[a-z'-]*?'){2,}) prevent strings which have 2 or more apostrophes
  • [a-z'-]+[,.]?(?=\s|$) match the word followed by some optional punctuation, and ensure this is followed by either a space or the end of a string

Examples

I'm not a C# programmer, but a returned array of matches from a code block like the one covered in question Return a array/list using regex and this regular expression will probably work for you. Note this expression does assume you'll use the case insensitive option.

Sample Text

\DR1234 - this is a word, 123456, frank's place DA123 SW1 :50:/  one-hyphen two-hyphens-here I-have-three-hyphens

Matches

[0] =>  this
[1] =>  is
[2] =>  a
[3] =>  word,
[4] =>  frank's
[5] =>  place
[6] =>  one-hyphen
[7] =>  two-hyphens-here
like image 122
Ro Yo Mi Avatar answered Oct 09 '22 15:10

Ro Yo Mi


the regex: \b\w+\b will match words or if you're more picky, than \b[a-zA-Z]+\b won't include numbers or _s

http://rubular.com/r/uOVvPTb5nh


It looks like you want to allow 's and ,s, so the regex: \b[a-zA-Z,']+\b will do an okay job at that, but it will also let slip through any number of things that you might not want(such as

,','hello''',World

or, in c#,

string str =@"\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/";
Regex r = new Regex(@"\b[a-zA-Z,']+\b");

string newStr = string.Join(" ", r.Matches(str).Cast<Match>().Select(m => m.Value).ToArray());
like image 21
Sam I am says Reinstate Monica Avatar answered Oct 09 '22 15:10

Sam I am says Reinstate Monica