<p>I have a text file:</p> <pre class="prettyprint"><code>$ cat text 542,8,1,418,1 542,9,1,418,1 301,34,1,689070,1 542,9,1,418,1 199,7,1,419,10 </code></pre> <p>I'd like to sort the file based on the first column and remove duplicates using <code>sort</code>, but things are not going as expected.</p> <h3>Approach 1</h3> <pre class="prettyprint"><code>$ sort -t, -u -b -k1n text 542,8,1,418,1 542,9,1,418,1 199,7,1,419,10 301,34,1,689070,1 </code></pre> <p>It is not sorting based on the first column.</p> <h3>Approach 2</h3> <pre class="prettyprint"><code>$ sort -t, -u -b -k1n,1n text 199,7,1,419,10 301,34,1,689070,1 542,8,1,418,1 </code></pre> <p>It removes the <code>542,9,1,418,1</code> line but I'd like to keep one copy.</p> <p>It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?</p>

<p>The problem is that when you provide a <code>key</code> to <code>sort</code> the unique occurrences are looked for that particular field. Since the line <code>542,8,1,418,1</code> is displayed, <code>sort</code> sees the next two lines starting with <code>542</code> as duplicate and filters them out. </p> <p>Your best bet would be to either sort all columns: </p> <pre class="prettyprint"><code>sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text </code></pre> <p>or </p> <p>use <code>awk</code> to filter duplicate lines and pipe it to <code>sort</code>. </p> <pre class="prettyprint"><code>awk '!_[$0]++' text | sort -t, -nk1,1 </code></pre>

Sort and remove duplicates based on column

Tags:

bash

shell

sorting

I have a text file:

$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10

I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.

Approach 1

$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1

It is not sorting based on the first column.

Approach 2

$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1

It removes the 542,9,1,418,1 line but I'd like to keep one copy.

It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?

226

asked Jul 25 '13 02:07

Yang

1 Answers

The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.

Your best bet would be to either sort all columns:

sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text

use awk to filter duplicate lines and pipe it to sort.

awk '!_[$0]++' text | sort -t, -nk1,1

answered Sep 30 '22 03:09

jaypal singh

Related questions
                            
                                Remote task queue using bash & ssh for variable number of live workers
                            
                                How can we use environment variables in a Jekyll config file?
                            
                                command output not captured by shell script when invoked by snmp pass
                            
                                Open url given by stdout from node --inspect command
                            
                                BrowserSync in Bash on Ubuntu on Windows (Linux Subsystem) - couldn't open browser
                            
                                Is there an easy way to delete untracked git files [duplicate]
                            
                                Get output from fzf in the terminal without executing it
                            
                                How to load a second Jenkinsfile from a Jenkinsfile but continue to use original workspace - multibranch pipeline
                            
                                "bash: sysctl: command not found" in debian:stretch-slim
                            
                                Get function backtrace in bash from trap handler (using caller)
                            
                                Option parser in bash more evolved than getopts
                            
                                In my bash prompt, $(__git_ps1) is telling me something is wrong, but what?
                            
                                Need RegExp help for Linux Bash grep command to filter out lines containing square brackets
                            
                                Bash comprehensive list of IP addresses for a domain
                            
                                How to print the contents of an OSX textclipping file from terminal?
                            
                                Xcode Build Script (Build Phases->Run Script) Increment Build Version based on Username(User)
                            
                                Haskell library for parsing Bash scripts?
                            
                                Resizing/Cropping and Appending 4 images
                            
                                Bash scripting printf'ing an escape sequence contained in a variable
                            
                                Why must I enter "\\\0" to create a string "\0" in zsh?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With