I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?
Your regex doesn't match the input. Let's break it down:
[a-z]
matches g
\d{4}
matches 1234
[a-z]*
doesn't match .
Also, I believe grep
and family don't like the \d
syntax. Try either [0-9]
or [:digit:]
Finally, when using regular expressions, prefer egrep
to grep
. I don't remember the exact details, but egrep
supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise *
will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.
grep
doesn't support \d
by defaul. To match a digit, use [0-9]
, or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With