Count number of similar lines in a file

Tags:

bash

I have a topic in a forum where people can write their Top 10 List of songs. I want to count how many times a song is listed. The similarity has to be compared case insensitive.

Example of the file structure:

Join Date: Apr 2005
Location: bama via new orleans
Age: 48
Posts: 2,369
Re: Top 10 Songs Jethro Tull
oh dearrrr. the only way for all kaths to keep their last shred of sanity: fly through this list as quickly as possible, without stopping to think for a microsecond...
velvet green
dun ringill
skating away on the thin ice of a new day
sossity yer a woman
fat man
life's a long song
jack-a-lynn
teacher
mother goose
elegy

 03-10-2010, 02:29 AM      #5 (permalink)
Sox
Avoiding The Swan Song



Join Date: Jan 2010
Location: Derbyshire, England
Age: 43
Posts: 5,991
 Re: Top 10 Songs Jethro Tull
Wow !!!! Where do I start ?
Dun Ringill
Aqualung
With You There To Help Me
Jack Frost And The Hooded Crow
We Used To Know
Witch's Promise
Pussy Willow
Heavy Horses
My Sunday Feeling
Locomotive Breath

Join Date: Nov 2009
Posts: 1,418
 Re: Top 10 Songs Jethro Tull
Too bad they all can't make the list, but here's ten I never get tired of listening to:

Christmas Song
Witches Promise
Life's A Long Song
Living In The Past
Rainbow Blues
Sweet Dream
Minstral In The Gallery
Cup of Wonder
Rover
Something's On the Move

Example output:

life's a long song 3
aqualung 1
...

681

asked Dec 24 '11 21:12

user219882

3 Answers

You're file's "structure" is a bit lacking in the structure department, so you'll have to deal with some errors in the process.

Assuming you have all that in a file called input, try:

tr '[A-Z]' '[a-z]' < input | \
     egrep -v "^ *(join date|age|posts|location|re):" | \
     sort | \
     uniq -c

First line lowercases everything, second strips out the things that look like email headers in your sample, then sort and count unique items.

answered Oct 06 '22 19:10

Mat

This command lists the lines and the number of times to repeat

sort nameFile | uniq -c

answered Oct 06 '22 20:10

Jhonathan

How about using awk for this -

awk '
/:/||/^$/{next}{a[toupper($0)]++}
END{for(i in a) print i,a[i]}' INPUT_FILE

Explanation:

First we identify lines that has : in them or are blank and ignore them. All other lines gets stored are converted to upper case and stored in an array. In our END statement we print out everything in our array and the number of times it was found.

Test:

awk '
/:/||/^$/{next}{a[toupper($0)]++}
END{for(i in a) print i,a[i]}' file1
SOX 1
CHRISTMAS SONG 1
CUP OF WONDER 1
SOSSITY YER A WOMAN 1
FAT MAN 1
PUSSY WILLOW 1
VELVET GREEN 1
WITH YOU THERE TO HELP ME 1
ELEGY 1
WE USED TO KNOW 1
TEACHER 1
MY SUNDAY FEELING 1
SWEET DREAM 1
JACK-A-LYNN 1
SOMETHING'S ON THE MOVE 1
ROVER 1
DUN RINGILL 2
AVOIDING THE SWAN SONG 1
JACK FROST AND THE HOODED CROW 1
WITCHES PROMISE 1
LIFE'S A LONG SONG 2
LIVING IN THE PAST 1
WITCH'S PROMISE 1
WOW !!!! WHERE DO I START ? 1
SKATING AWAY ON THE THIN ICE OF A NEW DAY 1
MINSTRAL IN THE GALLERY 1
RAINBOW BLUES 1
MOTHER GOOSE 1
HEAVY HORSES 1
AQUALUNG 1
LOCOMOTIVE BREATH 1

answered Oct 06 '22 20:10

jaypal singh

Related questions
                            
                                Multiple git commands in single command executed in order they are encountered by compiler
                            
                                Unix: How can I prepend output to a file?
                            
                                I encountered "unary operator expected" in a Bash script
                            
                                How to get the exit status a loop in bash
                            
                                Best way to parse command line args in Bash?
                            
                                Using GNU Parallel With Split
                            
                                BASH - If $TIME between 8am and 1pm do.., esle do.. Specifying time variables and if statements in BASH
                            
                                Detect daylight saving time in bash
                            
                                Explanation for what 'DEBUG=myapp:* npm start' is actually doing
                            
                                How to get the PID of a process in a pipeline
                            
                                using compound conditions in bash shell script
                            
                                Why does 2>&1 need to come before a | (pipe) but after a "> myfile" (redirect to file)?
                            
                                How to sort,uniq and display line that appear more than X times
                            
                                Converting ANSI to UTF-8 in shell
                            
                                bash exec sending output to a pipe, how?
                            
                                How to tail the last line of multiple files using "tail -1 */filename"
                            
                                Bash redirect and append to non-existent file/directory
                            
                                Connect CISCO Anyconnect VPN via bash
                            
                                Get the return code of a C program in my shell program
                            
                                Exclude a string from wildcard search in a shell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With