Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shell: Find Matching Lines Across Many Files

Tags:

grep

I am trying to use a shell script (well a "one liner") to find any common lines between around 50 files. Edit: Note I am looking for a line (lines) that appears in all the files

So far i've tried grep grep -v -x -f file1.sp * which just matches that files contents across ALL the other files.

I've also tried grep -v -x -f file1.sp file2.sp | grep -v -x -f - file3.sp | grep -v -x -f - file4.sp | grep -v -x -f - file5.sp etc... but I believe that searches using the files to be searched as STD in not the pattern to match on.

Does anyone know how to do this with grep or another tool?

I don't mind if it takes a while to run, I've got to add a few lines of code to around 500 files and wanted to find a common line in each of them for it to insert 'after' (they were originally just c&p from one file so hopefully there are some common lines!)

Thanks for your time,

like image 319
Pez Cuckow Avatar asked Sep 03 '12 11:09

Pez Cuckow


2 Answers

When I first read this I thought you were trying to find 'any common lines'. I took this as meaning "find duplicate lines". If this is the case, the following should suffice:

sort *.sp | uniq -d

Upon re-reading your question, it seems that you are actually trying to find lines that 'appear in all the files'. If this is the case, you will need to know the number of files in your directory:

find . -type f -name "*.sp" | wc -l

If this returns the number 50, you can then use awk like this:

WHINY_USERS=1 awk '{ array[$0]++ } END { for (i in array) if (array[i] == 50) print i }' *.sp

You can consolidate this process and write a one-liner like this:

WHINY_USERS=1 awk -v find=$(find . -type f -name "*.sp" | wc -l) '{ array[$0]++ } END { for (i in array) if (array[i] == find) print i }' *.sp
like image 108
Steve Avatar answered Oct 05 '22 09:10

Steve


old, bash answer (O(n); opens 2 * n files)

From @mjgpy3 answer, you just have to make a for loop and use comm, like this:

#!/bin/bash

tmp1="/tmp/tmp1$RANDOM"
tmp2="/tmp/tmp2$RANDOM"

cp "$1" "$tmp1"
shift
for file in "$@"
do
    comm -1 -2 "$tmp1" "$file" > "$tmp2"
    mv "$tmp2" "$tmp1"
done
cat "$tmp1"
rm "$tmp1"

Save in a comm.sh, make it executable, and call

./comm.sh *.sp 

assuming all your filenames end with .sp.

Updated answer, python, opens only each file once

Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.

Here you go (in python3):

#!/bin/env python
import argparse
import sys
import multiprocessing
import os

EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}

def extract_set(filename):
    with open(filename, 'rb') as f:
        return set(line.rstrip(b'\r\n') for line in f)

def find_common_lines(filenames):
    pool = multiprocessing.Pool()
    line_sets = pool.map(extract_set, filenames)
    return set.intersection(*line_sets)

if __name__ == '__main__':
    # usage info and argument parsing
    parser = argparse.ArgumentParser()
    parser.add_argument("in_files", nargs='+', 
            help="find common lines in these files")
    parser.add_argument('--out', type=argparse.FileType('wb'),
            help="the output file (default stdout)")
    parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
            help="(default: native)")
    args = parser.parse_args()

    # actual stuff
    common_lines = find_common_lines(args.in_files)

    # write results to output
    to_print = EOLS[args.eol_style].join(common_lines)
    if args.out is None:
        # find out stdout's encoding, utf-8 if absent
        encoding = sys.stdout.encoding or 'utf-8'
        sys.stdout.write(to_print.decode(encoding))
    else:
        args.out.write(to_print)

Save it into a find_common_lines.py, and call

python ./find_common_lines.py *.sp

More usage info with the --help option.

like image 39
bernard paulus Avatar answered Oct 05 '22 10:10

bernard paulus