Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression, split string by capital letter but ignore TLA

Tags:

.net

regex

I'm using the regex

System.Text.RegularExpressions.Regex.Replace(stringToSplit, "([A-Z])", " $1").Trim() 

to split strings by capital letter, for example:

'MyNameIsSimon' becomes 'My Name Is Simon'

I find this incredibly useful when working with enumerations. What I would like to do is change it slightly so that strings are only split if the next letter is a lowercase letter, for example:

'USAToday' would become 'USA Today'

Can this be done?

EDIT: Thanks to all for responding. I may not have entirely thought this through, in some cases 'A' and 'I' would need to be ignored but this is not possible (at least not in a meaningful way). In my case though the answers below do what I need. Thanks!

like image 284
Simon Avatar asked Jul 08 '09 12:07

Simon


People also ask

How do you split a string with capital letters?

To split a string on capital letters, call the split() method with the following regular expression - /(? =[A-Z])/ . The regular expression uses a positive lookahead assertion to split the string on each capital letter and returns an array of the substrings.

How do you match a capital letter in regex?

Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

Can we use regex in split a string?

split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.

What's wrong with splitting letters at the capital letters?

There's nothing "wrong" with it if it's supposed to split at the capital letters. The "1" and "@" aren't capital letters. It sounds like the problem would be more accurately stated as "I need to split at any character that is not followed by a lower case letter."

How to split string into words with multiple word boundary delimiters using regex?

Regex to split String into words with multiple word boundary delimiters. In this example, we will use the[\b\W\b]+ regex pattern to cater to any Non-alphanumeric delimiters. Using this pattern we can split string by multiple word boundary delimiters that will result in a list of alphanumeric/word tokens.

Why can't I split at 1 and @ in a string?

The "1" and "@" aren't capital letters. It sounds like the problem would be more accurately stated as "I need to split at any character that is not followed by a lower case letter." Edit: That's going to have the unintended consequence of adding a space at the end of the string.

What happens if maxsplit is 2 in regex?

If maxsplit is 2, at most two splits occur, and the remainder of the string is returned as the final element of the list. flags: By default, no flags are applied. There are many regex flags we can use. For example, the re.I is used for performing case-insensitive searching.


1 Answers

 ((?<=[a-z])[A-Z]|[A-Z](?=[a-z])) 

or its Unicode-aware cousin

 ((?<=\p{Ll})\p{Lu}|\p{Lu}(?=\p{Ll})) 

when replaced globally with

" $1" 

handles

 TodayILiveInTheUSAWithSimon USAToday IAmSOOOBored 

yielding

  Today I Live In The USA With Simon USA Today I Am SOOO Bored 

In a second step you'd have to trim the string.

like image 140
Tomalak Avatar answered Sep 30 '22 09:09

Tomalak