This is a really simple problem, but I cannot figure out how to script it. I cannot move forward until I figure it out. I'm really new to R and to using code, and I'm going through several introductory manuals, but haven't found anything for this specific problem yet.
Generally, here is the issue. Let's say I have a data frame called x
that looks like:
a <- c(1995,1995,1995,1996,1997,1997,1997,1998)
b <- c(1,2,3,1,2,3,4,1)
c <- c(5,7,8,2,4,5,7,8)
(x <- data.frame(a,b,c))
a b c
1 1995 1 5
2 1995 2 7
3 1995 3 9
4 1996 1 2
5 1997 2 4
6 1997 3 5
7 1997 4 7
8 1998 1 8
There are multiple entries for some of the years in column a
(i.e. 1995 appears 3 times), when really I just want one entry for each year. If I try to plot column a
against column c
, I will end up with multiple points for each date, but that is not helpful. I don't care about column b, but I want to sum entries for column c
for each year, such that I end up with a data frame with one entry for each year. Given the above data, a resulting data frame would look like:
a c
1 1995 21
2 1996 2
3 1997 16
4 1998 8
Any ideas?
Creating a Dataframe in R from Other Dataframes To combine DataFrames horizontally (i.e., adding the columns of one dataframe to the columns of the other), we use the cbind() function, where we pass the necessary DataFrames.
How to Create a Data Frame. We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.
The plyr
library is useful for aggregation tasks such as these. plyr
also plays very well with ggplot2
graphics. In my opinion, the benefit of plyr is that you explicitly define the structure of the input and output. Here we are passing in a data.frame
object and also want a data.frame
after processing, so we will use ddply
. The first letter corresponds to the input object, and the second to the output. So if we wanted to go from a list
object to data.frame
, we'd use ldply
, etc.
library(ggplot2) #Loads plyr
text <- "a b c
1995 1 5
1995 2 7
1995 3 9
1996 1 2
1997 2 4
1997 3 5
1997 4 7
1998 1 8
"
df <- read.table(textConnection(text), header = TRUE)
#Create plotData data.frame that groups by the "a" column and returns the sum of "c"
plotData <- ddply(df, "a", summarise, totalc = sum(c))
#plotting with ggplot
qplot(factor(a), totalc, data = plotData)
You need tapply
. For example,
## Your data
c1 = c(1995, 1995, 1995, 1996, 1997, 1997, 1997, 1998)
c2 = c(5, 7, 9, 2, 4, 5, 7, 8)
x = data.frame(c1, c2)
y = tapply(x$c2, x$c1, sum)
names(y) ## For the years
as.vector(y)
## So to get a data frame
data.frame(a=names(y), c=as.vector(y))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With