Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grep regular expression not working as expected

Tags:

I have a simple grep command trying to get only the first column of a CSV file including the comma. It goes like this...

grep -Eo '^[^,]+,' some.csv

So in my head, that reads like "get me only the matching part of the line where each line starts with at least one character that is not a comma, followed by a single comma."

So on a file, some.csv, that looks like this:

column1,column2,column3,column4
column1,column2,column3,column4
column1,column2,column3,column4

I'm expecting this output:

column1,
column1,
column1,

But I get this output:

column1,
column2,
column3,
column1,
column2,
column3,
column1,
column2,
column3,

Why is that? What am I missing from my grep/regex? Is my expected output incorrect?

If I remove the requirement of the trailing comma in the regex, the command works as I expect.

grep -Eo '^[^,]+' some.csv

Gives me:

column1
column1
column1

NOTE: I'm on macOS High Sierra with grep version: grep (BSD grep) 2.5.1-FreeBSD

like image 936
Craig Sketchley Avatar asked Jul 09 '18 07:07

Craig Sketchley


1 Answers

BSD grep is buggy in general. See the following related posts:

  • Why does this BSD grep result differ from GNU grep?
  • grep strange behaviour with single letter words
  • How to make BSD grep respect start-of-line anchor

That last link above mentions your case: when -o option is used, grep ignores the ^ anchor for some reason. This issue is also described in a FreeBSD bug:

I've noticed some more issues with the same version of grep. I don't know whether they're related, but I'll append them here for now.

$ printf abc | grep -o '^[a-c]'

should just print 'a', but instead gives three hits, against each letter of the incoming text.

As a workaround, it might be a better idea to just install GNU grep that works as expected.

Or, use sed with a BRE POSIX pattern:

sed -i '' 's/^\([^,]*,\).*/\1/' file

where the pattern matches

  • ^ - start of a line
  • \([^,]*,\) - Group 1 (later referred to with \1 backreference from the RHS):
    • [^,]* - zero or more chars other than ,
    • , - a , char
  • .* - the rest of the line.

Note that -i will change the file contents inplace. Use -i.bak to create a backup file if needed (then, you wouldn't need the next empty '' though).

like image 68
Wiktor Stribiżew Avatar answered Sep 28 '22 18:09

Wiktor Stribiżew