Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Even after `sort`, `uniq` is still repeating some values

Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz

(It is a tape archive that contains a file called Wiki-Vote.txt)

The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt

# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt 
# Wikipedia voting on promotion to administratorship (till January 2008). 
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId    ToNodeId
     30          1412
     30          3352
     30          5254
     30          5543
     30          7478
     3            28

I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,

awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l

Explanation:

  • /^#/ matches all the lines that start with #. And !/^#/ matches that doesn't.

  • awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt prints the first and second column of all those matched lines in new lines.

  • | sort pipes the output to sort them.

  • | uniq should display all those unique values, but it doesn't.

  • | wc -l counts the previous lines and it is wrong.

The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail returns,

992
993
993
994
994
995
996
998
999
999

Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.

like image 218
SigSegV Avatar asked Jan 13 '20 13:01

SigSegV


People also ask

How do I sort Uniq in Linux?

Ordering and manipulating data in Linux-based text files can be carried out using the sort and uniq utilities. The sort command orders a list of items both alphabetically and numerically, whereas the uniq command removes adjacent duplicate lines in a list.

What does the command sequence sort Uniq do when used at the end of a pipeline?

Pipe Construction The uniq command filters out adjacent matching lines in a file.

Which command is used to sort the list with no duplicate data and ignoring the case from each line use file name as LPU?

The uniq command takes input and removes repeated lines. Because uniq only removes identical adjacent lines, it is often used in conjunction with sort to remove non-adjacent duplicate lines.


1 Answers

The file has dos line endings - each line is ending with \r CR character.

You can inspect your tail output for example with hexdump -C, lines starting with # added by me:

$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | sort | uniq | tail | hexdump -C
00000000  39 39 32 0a 39 39 33 0a  39 39 33 0d 0a 39 39 34  |992.993.993..994|
#                                           ^^ HERE
00000010  0a 39 39 34 0d 0a 39 39  35 0d 0a 39 39 36 0a 39  |.994..995..996.9|
#                     ^^              ^^ 
00000020  39 38 0a 39 39 39 0a 39  39 39 0d 0a              |98.999.999..|
#                                        ^^
0000002c

Because uniq sees unique lines, one with CR and one not, they are not removed. Remove the CR character before pipeing. Note that sort | uniq is better to sort -u.

$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | tr -d '\r' | sort -u | wc -l
7115
like image 54
KamilCuk Avatar answered Sep 25 '22 15:09

KamilCuk