Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does grep work?

Tags:

c

grep

shell

unix

gnu

I am trying to understand how grep works.

When I say grep "hello" *.*, does grep get 2 arguments — (1) string to be searched i.e. "hello" and (2) path *.*? Or does the shell convert *.* into something that grep can understand?

Where can I get source code of grep? I came across this GNU grep link. One of the README files says its different from unix grep. How so?

I want to look at source of FreeBSD version of grep and also Linux version of it (if they are different).

like image 921
hari Avatar asked Aug 21 '11 07:08

hari


People also ask

How do you grep with strings?

The basic grep syntax when searching multiple patterns in a file includes using the grep command followed by strings and the name of the file or its path. The patterns need to be enclosed using single quotes and separated by the pipe symbol. Use the backslash before pipe | for regular expressions.

What does grep do in Ubuntu?

The grep command is used to search text file for patterns. A pattern can be a word, text, numbers and more. It is one of the most useful commands on Debian/Ubuntu/ Linux and Unix like operating systems.


2 Answers

The power of grep is the magic of automata theory. GREP is an abbreviation for Global Regular Expression Print. And it works by constructing an automaton (a very simple "virtual machine": not Turing Complete); it then "executes" the automaton against the input stream.

The automaton is a graph or network of nodes or states. The transition between states is determined by the input character under scrutiny. Special automatons like + and * work by having transitions that loop back to themselves. Character classes like [a-z] are represented by a fan: one start node with branches for each character out to the "spokes"; and usually the spokes have a special "epsilon transition" to a single final state so it can be linked up with the next automaton to be built from the regular expression (the search string). The epsilon transitions allow a change of state without moving forward in the string being searched.

Edit: It appears I didn't read the question very closely.

When you type a command-line, it is first pre-processed by the shell. The shell performs alias substitutions and filename globbing. After substituting aliases (they're like macros), the shell chops up the command-line into a list of arguments (space-delimited). This argument list is passed to the main() function of the executable command program as an integer count (often called argc) and a pointer to a NULL-terminated ((void *)0) array of nul-terminated ('\0') char arrays.

Individual commands make use of their arguments however they wish. But most Unix programs will print a friendly help message if given the -h argument (since it begins with a minus-sign, it's called an option). GNU software will also accept a "long-form" option --help.

Since there are a great many differences between different versions of Unix programs the most reliable way to discover the exact syntax that a program requires is to ask the program itself. If that doesn't tell you what you need (or it's too cryptic to understand), you should next check the local manpage (man grep). And for gnu software you can often get even more info from info grep.

like image 191
luser droog Avatar answered Sep 27 '22 18:09

luser droog


The shell does the globbing (conversion from * form to filenames). You can see this by if you have a simple C program:

#include <stdio.h>

int main(int argc, char **argv) {
    for(int i=1; i<argc; i++) {
        printf("%s\n", argv[i]);
    }
    return 0;
}

And then run it like this:

./print_args *

You'll see it prints out what matched, not * literally. If you invoke it like this:

./print_args '*'

You'll see it gets a literal *.

like image 25
icktoofay Avatar answered Sep 27 '22 20:09

icktoofay