I'm looking for the best solution, performance wise, to rebuild a string by removing words that are not complete words. An acceptable word in this instance is a whole word without numbers or doesn't start with a forward slash, or a back slash. So just letters only, but can include hyphen and apostrophe's
For example:
String str ="\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/"
Using the above I'd need a new string that returns the following:
Str = "this is a word, frank's place"
I've done some research on Regex
, but I can't find anything that would do what I need.
Final Code Snippet
var resultSet = Regex.Matches(item.ToLower(), @"(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)")
.Cast<Match>()
.Select(m => m.Value).ToArray();
Thanks for all your input guys - proves what a great site this is
Based on your comments: A word in this instance is:
a whole word without numbers
doesn't start with a forward slash, or a back slash
just letters only
can include hyphen and apostrophes
The character class to cover all the word characters by your definition would be [a-z'-]+
and that group could be surrounded by whitespace, or the start/end of a string. You sample also shows a comma so I'm presuming a word can be followed by a comma or dot either of which are followed by white space is ok too.
This regex will:
[a-z'-]+
(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)
(?:^|\s)
match the start of the string or a white space. This eliminates the need to test for word boundary which is problematic for strings like "abdc-egfh"(?![\\\/])
prevent the word from starting with a \ or /, however this is over kill as the character class doesn't allow it either(?!-+(?:\s|$))
prevent strings which are all hyphens(?!'+(?:\s|$))
prevent strings which are all apostrophes(?!(?:[a-z'-]*?-){3,})
prevent strings which have 3 or more hyphens(?!(?:[a-z'-]*?'){2,})
prevent strings which have 2 or more apostrophes[a-z'-]+[,.]?(?=\s|$)
match the word followed by some optional punctuation, and ensure this is followed by either a space or the end of a stringI'm not a C# programmer, but a returned array of matches from a code block like the one covered in question Return a array/list using regex and this regular expression will probably work for you. Note this expression does assume you'll use the case insensitive option.
Sample Text
\DR1234 - this is a word, 123456, frank's place DA123 SW1 :50:/ one-hyphen two-hyphens-here I-have-three-hyphens
Matches
[0] => this
[1] => is
[2] => a
[3] => word,
[4] => frank's
[5] => place
[6] => one-hyphen
[7] => two-hyphens-here
the regex: \b\w+\b
will match words or if you're more picky, than \b[a-zA-Z]+\b
won't include numbers or _
s
http://rubular.com/r/uOVvPTb5nh
It looks like you want to allow '
s and ,
s, so the regex: \b[a-zA-Z,']+\b
will do an okay job at that, but it will also let slip through any number of things that you might not want(such as
,','hello''',World
or, in c#,
string str =@"\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/";
Regex r = new Regex(@"\b[a-zA-Z,']+\b");
string newStr = string.Join(" ", r.Matches(str).Cast<Match>().Select(m => m.Value).ToArray());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With