Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find all substrings and longest substring

Tags:

java

c#

regex

I would usually do something like this using a string libray. But I wonder if it can be done using a regex.

I want to do the following: Given a search string:

Seattle is awesome

I want to find all substrings of it in a given sentence. So applying a regex to the following sentence

Seattle Seattle is awesome is awesome awesome is Seattle

Should give me

Seattle, Seattle is awesome, is awesome, awesome, is, Seattle

One constraint that might be helpful is that the sentence will always have only the words present in the search string and whitespaces in between.

Note If there's a match, it should be the longest possible string. So like in the above example the matches shouldn't be single words rather longest possible substrings. Order among words also needs to be maintained. That's why

awesome is Seattle

in the sentence above gives us

awesome, is and Seattle

I am not sure if something like this can be done with a regex though, since it is greedy. Would appreciate any insight into this! I'm familiar with both C# and Java and could use either of their regex libraries.

like image 989
John Avatar asked Jul 10 '11 23:07

John


2 Answers

I don't think you can do this with a regex. Wikipedia has a good article on the longest common subsequence problem.

like image 100
highlycaffeinated Avatar answered Oct 03 '22 23:10

highlycaffeinated


There is no good way to express such a pattern directly with a regex.

You'll need to list all allowed combinations:

Seattle is awesome|Seattle is|Seattle|is awesome|is|awesome

or more succinctly:

Seattle( is( awesome)?)?|is( awesome)?|awesome

You can programmatically transform your input string to this format.

like image 41
dtb Avatar answered Oct 04 '22 00:10

dtb