Deleting lines from one file which are in another file

People also ask

How do I remove a common line from two files in UNIX?

To remove common lines between two files you can use grep , comm or join command. grep only works for small files. Use -v along with -f . This displays lines from file1 that do not match any line in file2 .

Which command is used to delete file lines?

The rm command is used to delete files.

grep -v -x -f f2 f1 should do the trick.

Explanation:

-v to select non-matching lines
-x to match whole lines only
-f f2 to get patterns from f2

One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

Try comm instead (assuming f1 and f2 are "already sorted")

comm -2 -3 f1 f2

For exclude files that aren't too huge, you can use AWK's associative arrays.

awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt

The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.

The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):

awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt

Accessing r[$0] creates the entry for that line, no need to set a value.

Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.

if you have Ruby (1.9+)

#!/usr/bin/env ruby 
b=File.read("file2").split
open("file1").each do |x|
  x.chomp!
  puts x if !b.include?(x)
end

Which has O(N^2) complexity. If you want to care about performance, here's another version

b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}

which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)

here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:

$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test

real    0m0.639s
user    0m0.554s
sys     0m0.021s

$time sort file1 file2|uniq -u  > sort.test

real    0m2.311s
user    0m1.959s
sys     0m0.040s

$ diff <(sort -n ruby.test) <(sort -n sort.test)
$

diff was used to show there are no differences between the 2 files generated.

Some timing comparisons between various other answers:

$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null

real    0m0.019s
user    0m0.023s
sys     0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null

real    0m0.026s
user    0m0.018s
sys     0m0.007s
$ time grep -xvf f2 f1 > /dev/null

real    0m43.197s
user    0m43.155s
sys     0m0.040s

sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.

comm can also be used with stdin and here strings:

echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a

Seems to be a job suitable for the SQLite shell:

create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q

Related questions
                            
                                How do I execute a bash script in Terminal?
                            
                                How can I retrieve the first word of the output of a command in Bash?
                            
                                How to extract one column of a csv file
                            
                                How to reverse-i-search back and forth? [duplicate]
                            
                                Raise error in a Bash script
                            
                                Linux bash: Multiple variable assignment
                            
                                Bash script to set up a temporary SSH tunnel
                            
                                Save current directory in variable using Bash?
                            
                                Multi-line bash commands in makefile
                            
                                What is the use case of noop [:] in bash?
                            
                                What does `kill -0 $pid` in a shell script do?
                            
                                Remove an element from a Bash array
                            
                                What are the differences among grep, awk & sed? [duplicate]
                            
                                Count occurrences of a char in a string using Bash
                            
                                Recursively remove files
                            
                                DESTDIR and PREFIX of make
                            
                                Remove first element from $@ in bash [duplicate]
                            
                                How to remove a newline from a string in Bash
                            
                                How do you search for files containing DOS line endings (CRLF) with grep on Linux?
                            
                                Diff output from two programs without temporary files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Deleting lines from one file which are in another file

Tags:

bash

scripting

sh

People also ask

Recent Activity

Donate For Us