Even after `sort`, `uniq` is still repeating some values

Tags:

Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz

(It is a tape archive that contains a file called Wiki-Vote.txt)

The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt

# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt 
# Wikipedia voting on promotion to administratorship (till January 2008). 
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId    ToNodeId
     30          1412
     30          3352
     30          5254
     30          5543
     30          7478
     3            28

I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,

awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l

Explanation:

/^#/ matches all the lines that start with #. And !/^#/ matches that doesn't.
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt prints the first and second column of all those matched lines in new lines.
| sort pipes the output to sort them.
| uniq should display all those unique values, but it doesn't.
| wc -l counts the previous lines and it is wrong.

The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail returns,

Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.

218

asked Jan 13 '20 13:01

SigSegV

1 Answers

The file has dos line endings - each line is ending with \r CR character.

You can inspect your tail output for example with hexdump -C, lines starting with # added by me:

$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | sort | uniq | tail | hexdump -C
00000000  39 39 32 0a 39 39 33 0a  39 39 33 0d 0a 39 39 34  |992.993.993..994|
#                                           ^^ HERE
00000010  0a 39 39 34 0d 0a 39 39  35 0d 0a 39 39 36 0a 39  |.994..995..996.9|
#                     ^^              ^^ 
00000020  39 38 0a 39 39 39 0a 39  39 39 0d 0a              |98.999.999..|
#                                        ^^
0000002c

Because uniq sees unique lines, one with CR and one not, they are not removed. Remove the CR character before pipeing. Note that sort | uniq is better to sort -u.

$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | tr -d '\r' | sort -u | wc -l
7115

answered Sep 25 '22 15:09

KamilCuk

Related questions
                            
                                send stdout/stderr to console for a systemd service
                            
                                Where are plugins/tools installed from the Eclipse marketplace located on linux? (Mars)
                            
                                How to access hadoop web UI in linux?
                            
                                How to disable interrupt in a C program in linux
                            
                                How to setup syslog in yocto?
                            
                                Naming convention for posix flags
                            
                                Redirect STDERR to a variable [duplicate]
                            
                                How to tell if a library is compiled with certain GCC version
                            
                                Why the Memory locations for two variables which is allocated dynamically are not consecutive? [duplicate]
                            
                                Use multiple output stream in python?
                            
                                Error: Node Sass does not yet support your current environment: Linux 64-bit with Unsupported runtime (64)
                            
                                Sed Error "extra characters at the end of g command"
                            
                                How to run Angular6 E2E tests on Alpine Linux
                            
                                Compile C++17 code on RedHat Linux Enterprise Developer Workstation
                            
                                force linux sort to use lexicographic order
                            
                                What does '! -e "/etc/httpd"' in Perl in 'if' Condition
                            
                                What is the difference between SIGABRT and SIGSEGV
                            
                                Linking a program using printf with ld?
                            
                                Calling dlsym() with a NULL handle doesn't return NULL, but rather returns a random function
                            
                                Is it possible to disable sudo timeout in the current shell?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Even after `sort`, `uniq` is still repeating some values

Tags:

linux

posix

carriage-return

uniq

SigSegV

People also ask

1 Answers

KamilCuk

Recent Activity

Donate For Us