Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a new data frame in R from an exisiting, inadequate data frame

Tags:

dataframe

r

This is a really simple problem, but I cannot figure out how to script it. I cannot move forward until I figure it out. I'm really new to R and to using code, and I'm going through several introductory manuals, but haven't found anything for this specific problem yet.

Generally, here is the issue. Let's say I have a data frame called x that looks like:

a <- c(1995,1995,1995,1996,1997,1997,1997,1998)
b <- c(1,2,3,1,2,3,4,1)
c <- c(5,7,8,2,4,5,7,8)
(x <- data.frame(a,b,c))
     a b c
1 1995 1 5
2 1995 2 7
3 1995 3 9
4 1996 1 2
5 1997 2 4
6 1997 3 5
7 1997 4 7
8 1998 1 8

There are multiple entries for some of the years in column a (i.e. 1995 appears 3 times), when really I just want one entry for each year. If I try to plot column a against column c, I will end up with multiple points for each date, but that is not helpful. I don't care about column b, but I want to sum entries for column c for each year, such that I end up with a data frame with one entry for each year. Given the above data, a resulting data frame would look like:

     a  c
1 1995 21
2 1996  2
3 1997 16
4 1998  8

Any ideas?

like image 218
Jota Avatar asked Mar 05 '11 17:03

Jota


People also ask

How do I create a data frame from a different Dataframe in R?

Creating a Dataframe in R from Other Dataframes To combine DataFrames horizontally (i.e., adding the columns of one dataframe to the columns of the other), we use the cbind() function, where we pass the necessary DataFrames.

How do I create a Dataframe from a dataset in R?

How to Create a Data Frame. We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.


2 Answers

The plyr library is useful for aggregation tasks such as these. plyr also plays very well with ggplot2 graphics. In my opinion, the benefit of plyr is that you explicitly define the structure of the input and output. Here we are passing in a data.frame object and also want a data.frame after processing, so we will use ddply. The first letter corresponds to the input object, and the second to the output. So if we wanted to go from a list object to data.frame, we'd use ldply, etc.

library(ggplot2) #Loads plyr

text <- "a b c
1995 1 5
1995 2 7
1995 3 9
1996 1 2
1997 2 4
1997 3 5
1997 4 7
1998 1 8
"

df <- read.table(textConnection(text), header = TRUE)

#Create plotData data.frame that groups by the "a" column and returns the sum of "c"
plotData <- ddply(df, "a", summarise, totalc = sum(c))

#plotting with ggplot
qplot(factor(a), totalc, data = plotData)
like image 118
Chase Avatar answered Oct 07 '22 13:10

Chase


You need tapply. For example,

## Your data
c1 = c(1995, 1995, 1995, 1996, 1997,  1997, 1997, 1998) 
c2 = c(5, 7, 9, 2, 4, 5, 7, 8)
x = data.frame(c1, c2)


y = tapply(x$c2, x$c1, sum)
names(y) ## For the years
as.vector(y)

## So to get a data frame
data.frame(a=names(y), c=as.vector(y))
like image 24
csgillespie Avatar answered Oct 07 '22 12:10

csgillespie