Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String pattern matching problem

Imagine we have a long string containing the substrings 'cat' and 'dog' as well as other random characters, eg.

cat x dog cat x cat x dog x dog x cat x dog x cat

Here 'x' represents any random sequence of characters (but not 'cat' or 'dog').

What I want to do is find every 'cat' that is followed by any characters except 'dog' and then by 'cat'. I want to remove that first instance of 'cat' in each case.

In this case, I would want to remove the bracketed [cat] because there is no 'dog' after it before the next 'cat':

cat x dog [cat] x cat x dog x dog x cat x dog x cat

To end up with:

cat x dog x cat x dog x dog x cat x dog x cat

How can this be done?

I thought of somehow using a regular expression like (n)(?=(n)) as VonC recommended here

(cat)(?=(.*cat))

to match all of the pairs of 'cat' in the string. But I am still not sure how I could use this to remove each cat that is not followed by 'dog' before 'cat'.


The real problem I am tackling is in Java. But I am really just looking for a general pseudocode/regex solution.

like image 865
nodmonkey Avatar asked Oct 28 '10 16:10

nodmonkey


People also ask

What is the problem of string matching?

A shift is valid if P occurs with shift s in T and invalid otherwise. The string-matching problem is the problem of finding all valid shifts for a given choice of P and T. P ≡ dada Valid shifts are two, twelve and fourteen.

What is matching string pattern?

A string enclosed within double quotes ('"') is used exclusively for pattern matching (patterns are a simplified form of regular expressions - used in most UNIX commands for string matching).

Which of the following algorithm are used for string and pattern matching problems?

Hashing-string matching algorithms: Rabin Karp Algorithm: It matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.

Which algorithm is best for string matching?

The so-called naive or brute force algorithm is the most intuitive approach to the string pattern-matching problem. This algorithm attempts simply to match the pattern in the target at successive positions from left to right.


1 Answers

Is there any particular reason you want to do this with just one RE call? I'm not sure if that's actually possible in one RE.

If I had to do this, I'd probably go in two passes. First mark each instance of 'cat' and 'dog' in the string, then write some code to identify which cats need to be removed, and do that in another pass.

Pseudocode follows:

// Find all the cats and dogs
int[] catLocations = string.findIndex(/cat/);
int[] dogLocations = string.findIndex(/dog/);
int [] idsToRemove = doLogic(catLocations, dogLocations);

// Remove each identified cat, from the end to the front
for (int id : idsToRemove.reverse())
  string.removeSubstring(id, "cat".length());
like image 129
zigdon Avatar answered Nov 15 '22 09:11

zigdon