I want to find the column with the largest column sum. I am thinking of something like:
threeLargest = colnames(sort(colSums(data[,2:length(data)]),
decreasing = TRUE)[1:3])
but colnames
just gives NULL
with the sort(colSums...
command.
The reason is that I want to be able to refer to the values in the column and plot it. I was thinking that there had to be a more R-oriented solution than looping over the columns and keeping count of the largest ones.
I have example_csv_file.csv
:
date,column1,column2,column3,column4
2013-12-09,0,0,0,2
2013-12-10,0,0,0,2
2013-12-11,0,0,0,2
2013-12-12,0,0,0,2
2013-12-13,0,0,0,2
2013-12-14,0,1,7,2
2013-12-15,2,15,36,2
2013-12-16,5,10,28,2
2013-12-17,1,2,39,2
2013-12-18,2,3,34,2
which I import this way:
data = read.csv(file = 'example_csv_file.csv', header = TRUE, sep = ",")
I can sort the columns by their column sum, and fetch the top three:
threeLargest = sort(colSums(data[,2:length(data)]), decreasing = TRUE)[1:3]
This gives:
> threeLargest
column3 column2 column4
144 31 20
but I need to obtain the column names because I need to refer to the columns when I plot their values. E.g. this way:
plot(data[,'column3'])
and preferably have a list of the top ones which I could refer to in a loop, like this:
plot(data[,namesOfThreeLargest[1]], type = 'n')
color = 1
for (column in namesOfThreeLargest)
{
lines(data[,column], col = color)
color = color + 1
}
legend("topleft", inset=.05, lty = 1, namesOfThreeLargest, col = seq(color))
If I could obtain the number of the column in a neat way, I could get the name of it this way:
columnWithLargestColSum = colnames(data)[4]
I have tried importing the file differently, e.g. read.table(file =...
, read.data.frame(file =...
and as.matrix(read.csv(file =...
, to see if colnames
works then, but it does not. In fact colSums
does not even work for the as.matrix
one since the entries are strings for that method.
Thanks!
Edit:
This is the solution I went with:
I used order()
from Joris Meys and I used names()
from Ananda Mahto (see their solutions below):
colCount = colSums(data[-1])
topThreeIds = order(colCount,decreasing=TRUE)[1:3] + 1 # From Joris
topThreeCols = names(data[topIds]) # From Ananda
Note the + 1
in the 2nd line, due to the fact that I'm skipping the date
column in the 1st line. By adding one in the 2nd line I get an actual id of the columns I want.
Thanks, guys!
If you view the str
of the output of your colSums
step, you'll see it's just a named vector, not anything with "columns":
str(sort(colSums(data[,2:length(data)]),
decreasing = TRUE)[1:3])
# Named num [1:3] 144 31 20
# - attr(*, "names")= chr [1:3] "column3" "column2" "column4"
As such, if you want the "names", you should wrap the command in names
instead of colnames
.
In other words:
namesOfThreeLargest <- names(threeLargest)
From there, now that I see you just want to do multiple line plots, you can look at matplot
, for instance:
matplot(data[, namesOfThreeLargest], type="l")
I'd not insist on using sort()
. Using order()
can be faster and more appropriate. You could also use the list-indexing to make your code more readible.
So
id <- order(colSums(Data[-1]),decreasing=TRUE)[1:3]
matplot(Data[id],type='l')
would be a faster and more concise way of doing it.
A alternative solution is to use sort.list
instead of sort
, which will return the columns in order from largest to smallest (add 1 to the index since we're ignoring the first column):
colnames(data)[sort.list(colSums(data[,-1]), decreasing=TRUE)[1:3] + 1]
If you're feeling particularly lazy, you can also use rev()
to reverse the order, instead of typing out decreasing=TRUE
:
colnames(data)[rev(sort.list(colSums(data[,-1])))[1:3] + 1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With