Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GREP and RegEx - find pattern and look for it again

Tags:

regex

grep

Here's what I want to do:

Search a document for a pattern containing RegEx, then check if this exact pattern is present twice inside of a line.

Content of file.xml:
(some code) "testen"  (more code)  >testete<
(some code) "bleiben" (more code)  >bleiben<
(some code) "stehen"  (more code)  >stand<
(some code) "hängen"  (more code)  >hängten<
... 

Now I want to check for .*en and check if the (exact) same word occurs twice in the line. So the outcome should be:

bleiben

Because Testen != testete, stehen != stand, hängen != hängten

Is there a way to do this?

like image 324
Mat Fluor Avatar asked Jan 16 '23 10:01

Mat Fluor


1 Answers

You can handle this search on the first grep line by using the pattern: .*en.*en:

grep .*en.*en your_file

This will output only the lines that have en appearing twice in them.

If you need to handle it in two back-to-back grep's, you could still use this same command in a piped version:

grep .*en your_file | grep .*en.*en

Also, if you ever want to increase the number of instances in the same line, you can take advantage of grep's -P option and use a Perl regex:

grep -P "(.*en){2}" your_file

With this, you can just change the {2} to however-many instances you want it to appear in a single line and it should work.

EDIT (to find lines with exact same word twice)

This is difficult without an extended pattern that can define the boundaries of a word - and your example output doesn't really help much. To go for a straight-to-the-point example, we can just assume a "word" is any alphabetical string a-z that's ending with en. You can customize this boundary as needed:

grep -P "([a-z]+en).*\1" your_file

This will print any line that has a word ending in en that is found elsewhere in the line (the \1).

One caveat to mention, which relates to the word-boundary issue noted above. In the context of "bleiben" and "bleiben", they are equal. However, in the context of "ben" and "bleiben", this pattern will also match because it will see then ending "ben" from "bleiben" as the matching pattern (thereby using "ben" = "ben"). If this is not acceptable, you will have to establish a more-strict word-boundary (i.e. - don't allow special characters?).

like image 139
newfurniturey Avatar answered Jan 31 '23 00:01

newfurniturey