I am looking for a more specific version of the :sort u
command that would allow removing all duplicate lines from a file. I am working with a CSV file, and would like to remove all the lines that have duplicates in their second-column entry. In other words, two lines are declared to be duplicates if they have the same value in the second column.
For example, for the following file:
a,1,b
g,1,f
c,1,x
i,2,l
m,1,k
o,2,p
u,1,z
the command in question should yield:
a,1,b
i,2,l
The choice of the specific rows to be kept are not important, as long as the second column entries are all unique.
What Vim command will produce the output above?
Thanks!
Since it is not possible to achieve the transformation in question in
one run of the :sort
command, let us approach it as a two-step process.
1. The first step is sorting lines by the values of the second column
(separated from the first one by a comma). In order to do that, we can
use the :sort
command, passing a regular expression that matches the
first column and the following comma:
:sort/^[^,]*,/
As :sort
compares the text starting just after the match of the
specified pattern on each line, it gives us the desired sorting
behavior. To compare the values numerically rather than
lexicographically, use the n
flag:
:sort n/^[^,]*,/
2. The second step involves running through the sorted lines and removing
all lines but one in every block of consecutive lines with the same
value in the second column. It is convenient to build our implementation
upon the :global
command, which executes a given Ex command on every
line matching a certain pattern. For our purposes, a line can be
deleted if it contains the same value in the second column as the
following line. This formalization—accompanied with the initial
assumption that commas cannot occur within column values—gives us
the following pattern:
^[^,]*,\([^,]*\),.*\n[^,]*,\1,.*
If we run the :delete
command on every line that satisfies this
pattern, going from top to bottom over them in sorted order, we will
have only a single line for every distinct value in the second column:
:g/^[^,]*,\([^,]*\),.*\n[^,]*,\1,.*/d_
3. Finally, both of the steps can be combined in a single Ex command:
:sort/^[^,]*,/|g/^[^,]*,\([^,]*\),.*\n[^,]*,\1,.*/d_
:sort /\([^,]*,\)\{1}/
:g/\%(\%([^,]*,\)\{1}\1.*\n\)\@<=\%([^,]*,\)\{1}\([^,]*\)/d
first sort by column with index 1. second match any line whos column index 1 matches the next lines column index 1 and delete it.
column index is the 1 in the {1}
. it's repeated 3 times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With