Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: find column with the largest column sum

Tags:

sorting

r

csv

I want to find the column with the largest column sum. I am thinking of something like:

threeLargest = colnames(sort(colSums(data[,2:length(data)]), 
                        decreasing = TRUE)[1:3])

but colnames just gives NULL with the sort(colSums... command.

The reason is that I want to be able to refer to the values in the column and plot it. I was thinking that there had to be a more R-oriented solution than looping over the columns and keeping count of the largest ones.

I have example_csv_file.csv:

date,column1,column2,column3,column4
2013-12-09,0,0,0,2
2013-12-10,0,0,0,2
2013-12-11,0,0,0,2
2013-12-12,0,0,0,2
2013-12-13,0,0,0,2
2013-12-14,0,1,7,2
2013-12-15,2,15,36,2
2013-12-16,5,10,28,2
2013-12-17,1,2,39,2
2013-12-18,2,3,34,2

which I import this way:

data = read.csv(file = 'example_csv_file.csv', header = TRUE, sep = ",")

I can sort the columns by their column sum, and fetch the top three:

threeLargest = sort(colSums(data[,2:length(data)]), decreasing = TRUE)[1:3]

This gives:

> threeLargest
column3 column2 column4 
    144      31      20 

but I need to obtain the column names because I need to refer to the columns when I plot their values. E.g. this way:

plot(data[,'column3'])

and preferably have a list of the top ones which I could refer to in a loop, like this:

plot(data[,namesOfThreeLargest[1]], type = 'n')
color = 1
for (column in namesOfThreeLargest)
{
  lines(data[,column], col = color)
  color = color + 1
}
legend("topleft", inset=.05, lty = 1, namesOfThreeLargest, col = seq(color))

If I could obtain the number of the column in a neat way, I could get the name of it this way:

columnWithLargestColSum = colnames(data)[4]

I have tried importing the file differently, e.g. read.table(file =..., read.data.frame(file =... and as.matrix(read.csv(file =..., to see if colnames works then, but it does not. In fact colSums does not even work for the as.matrix one since the entries are strings for that method.

Thanks!


Edit:

This is the solution I went with:

I used order() from Joris Meys and I used names() from Ananda Mahto (see their solutions below):

colCount = colSums(data[-1])
topThreeIds = order(colCount,decreasing=TRUE)[1:3] + 1 # From Joris
topThreeCols = names(data[topIds]) # From Ananda

Note the + 1 in the 2nd line, due to the fact that I'm skipping the date column in the 1st line. By adding one in the 2nd line I get an actual id of the columns I want.

Thanks, guys!

like image 400
stefaniabje Avatar asked Dec 19 '13 10:12

stefaniabje


3 Answers

If you view the str of the output of your colSums step, you'll see it's just a named vector, not anything with "columns":

str(sort(colSums(data[,2:length(data)]), 
                 decreasing = TRUE)[1:3])
#  Named num [1:3] 144 31 20
#  - attr(*, "names")= chr [1:3] "column3" "column2" "column4"

As such, if you want the "names", you should wrap the command in names instead of colnames.

In other words:

namesOfThreeLargest <- names(threeLargest)

From there, now that I see you just want to do multiple line plots, you can look at matplot, for instance:

matplot(data[, namesOfThreeLargest], type="l")
like image 78
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 21 '22 13:10

A5C1D2H2I1M1N2O1R2T1


I'd not insist on using sort(). Using order() can be faster and more appropriate. You could also use the list-indexing to make your code more readible.

So

id <- order(colSums(Data[-1]),decreasing=TRUE)[1:3]
matplot(Data[id],type='l')

would be a faster and more concise way of doing it.

like image 44
Joris Meys Avatar answered Oct 21 '22 13:10

Joris Meys


A alternative solution is to use sort.list instead of sort, which will return the columns in order from largest to smallest (add 1 to the index since we're ignoring the first column):

colnames(data)[sort.list(colSums(data[,-1]), decreasing=TRUE)[1:3] + 1]

If you're feeling particularly lazy, you can also use rev() to reverse the order, instead of typing out decreasing=TRUE:

colnames(data)[rev(sort.list(colSums(data[,-1])))[1:3] + 1]
like image 41
Scott Ritchie Avatar answered Oct 21 '22 12:10

Scott Ritchie