Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run the ‘:sort u’ command in Vim on a CSV table, but only use the values in a particular column as sorting keys?

Tags:

vim

I am looking for a more specific version of the :sort u command that would allow removing all duplicate lines from a file. I am working with a CSV file, and would like to remove all the lines that have duplicates in their second-column entry. In other words, two lines are declared to be duplicates if they have the same value in the second column.

For example, for the following file:

a,1,b
g,1,f
c,1,x
i,2,l
m,1,k
o,2,p
u,1,z

the command in question should yield:

a,1,b
i,2,l

The choice of the specific rows to be kept are not important, as long as the second column entries are all unique.

What Vim command will produce the output above?

Thanks!

like image 601
Jonah Avatar asked Feb 03 '23 06:02

Jonah


2 Answers

Since it is not possible to achieve the transformation in question in one run of the :sort command, let us approach it as a two-step process.

1. The first step is sorting lines by the values of the second column (separated from the first one by a comma). In order to do that, we can use the :sort command, passing a regular expression that matches the first column and the following comma:

:sort/^[^,]*,/

As :sort compares the text starting just after the match of the specified pattern on each line, it gives us the desired sorting behavior. To compare the values numerically rather than lexicographically, use the n flag:

:sort n/^[^,]*,/

2. The second step involves running through the sorted lines and removing all lines but one in every block of consecutive lines with the same value in the second column. It is convenient to build our implementation upon the :global command, which executes a given Ex command on every line matching a certain pattern. For our purposes, a line can be deleted if it contains the same value in the second column as the following line. This formalization—accompanied with the initial assumption that commas cannot occur within column values—gives us the following pattern:

^[^,]*,\([^,]*\),.*\n[^,]*,\1,.*

If we run the :delete command on every line that satisfies this pattern, going from top to bottom over them in sorted order, we will have only a single line for every distinct value in the second column:

:g/^[^,]*,\([^,]*\),.*\n[^,]*,\1,.*/d_

3. Finally, both of the steps can be combined in a single Ex command:

:sort/^[^,]*,/|g/^[^,]*,\([^,]*\),.*\n[^,]*,\1,.*/d_
like image 113
ib. Avatar answered Feb 04 '23 20:02

ib.


:sort /\([^,]*,\)\{1}/
:g/\%(\%([^,]*,\)\{1}\1.*\n\)\@<=\%([^,]*,\)\{1}\([^,]*\)/d

first sort by column with index 1. second match any line whos column index 1 matches the next lines column index 1 and delete it.

column index is the 1 in the {1}. it's repeated 3 times.

like image 33
Tom Whittock Avatar answered Feb 04 '23 18:02

Tom Whittock