I have a list of dataframes which I eventually want to merge while maintaining a record of their original dataframe name or list index. This will allow me to subset etc across all the rows. To accomplish this I would like to add a new variable 'id' to every dataframe, which contains the name/index of the dataframe it belongs to. Edit: "In my real code the dataframe variables are created from reading multiple files using the following code, so I don't have actual names only those in the 'files.to.read' list which I'm unsure if they will align with the dataframe order: <pre class="prettyprint"><code>mylist <- llply(files.to.read, read.csv) </code></pre> A few methods have been highlighted in several posts: Working-with-dataframes-in-a-list-drop-variables-add-new-ones and Using-lapply-with-changing-arguments I have tried two similar methods, the first using the index list: <pre class="prettyprint"><code>df1 <- data.frame(x=c(1:5),y=c(11:15)) df2 <- data.frame(x=c(1:5),y=c(11:15)) mylist <- list(df1,df2) # Adds a new coloumn 'id' with a value of 5 to every row in every dataframe. # I WANT to change the value based on the list index. mylist1 <- lapply(mylist, function(x){ x$id <- 5 return (x) } ) #Example of what I WANT, instead of '5'. #> mylist1 #[[1]] #x y id #1 1 11 1 #2 2 12 1 #3 3 13 1 #4 4 14 1 #5 5 15 1 # #[[2]] #x y id #1 1 11 2 #2 2 12 2 #3 3 13 2 #4 4 14 2 #5 5 15 2 </code></pre> The second attempts to pass the names() of the list. <pre class="prettyprint"><code># I WANT it to add a new coloumn 'id' with the name of the respective dataframe # to every row in every dataframe. mylist2 <- lapply(names(mylist), function(x){ portfolio.results[[x]]$id <- "dataframe name here" return (portfolio.results[[x]]) } ) #Example of what I WANT, instead of 'dataframe name here'. # mylist2 #[[1]] #x y id #1 1 11 df1 #2 2 12 df1 #3 3 13 df1 #4 4 14 df1 #5 5 15 df1 # #[[2]] #x y id #1 1 11 df2 #2 2 12 df2 #3 3 13 df2 #4 4 14 df2 #5 5 15 df2 </code></pre> But the names() function doesn't work on a list of dataframes; it returns NULL. Could I use seq_along(mylist) in the first example. Any ideas or better way to handle the whole "merge with source id" Edit - Added Solution below: I've implemented a solution using Hadleys suggestion and Tommy’s nudge which looks something like this. <pre class="prettyprint"><code>files.to.read <- list.files(datafolder, pattern="\\_D.csv$", full.names=FALSE) mylist <- llply(files.to.read, read.csv) all <- do.call("rbind", mylist) all$id <- rep(files.to.read, sapply(mylist, nrow)) </code></pre> I used the files.to.read vector as the id for each dataframe I also changed from using merge_recurse() as it was very slow for some reason. <pre class="prettyprint"><code> all <- merge_recurse(mylist) </code></pre> Thanks everyone.

Personally, I think it's easier to add the names after collapse: <pre class="prettyprint"><code>df1 <- data.frame(x=c(1:5),y=c(11:15)) df2 <- data.frame(x=c(1:5),y=c(11:15)) mylist <- list(df1 = df1, df2 = df2) all <- do.call("rbind", mylist) all$id <- rep(names(mylist), sapply(mylist, nrow)) </code></pre>

Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list. Both solutions below use the fact that <code>lapply</code> can pass extra parameters (mylist) to the function. <pre class="prettyprint"><code>df1 <- data.frame(x=c(1:5),y=c(11:15)) df2 <- data.frame(x=c(1:5),y=c(11:15)) mylist <- list(df1=df1,df2=df2) # Name each data.frame! # names(mylist) <- c("df1", "df2") # Alternative way of naming... # Use indices - and pass in mylist mylist1 <- lapply(seq_along(mylist), function(i, x){ x[[i]]$id <- i return (x[[i]]) }, mylist ) # Now the names work - but I pass in mylist instead of using portfolio.results. mylist2 <- lapply(names(mylist), function(n, x){ x[[n]]$id <- n return (x[[n]]) }, mylist ) </code></pre>

<code>names()</code> could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices: <pre class="prettyprint"><code>> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) } > mylist [[1]] x y id 1 1 11 1 2 2 12 1 3 3 13 1 4 4 14 1 5 5 15 1 [[2]] x y id 1 1 11 2 2 2 12 2 3 3 13 2 4 4 14 2 5 5 15 2 </code></pre>

Dataframes in a list; adding a new variable with name of dataframe

Tags:

list

dataframe

r

lapply

names

I have a list of dataframes which I eventually want to merge while maintaining a record of their original dataframe name or list index. This will allow me to subset etc across all the rows. To accomplish this I would like to add a new variable 'id' to every dataframe, which contains the name/index of the dataframe it belongs to.

Edit: "In my real code the dataframe variables are created from reading multiple files using the following code, so I don't have actual names only those in the 'files.to.read' list which I'm unsure if they will align with the dataframe order:

mylist <- llply(files.to.read, read.csv)

A few methods have been highlighted in several posts: Working-with-dataframes-in-a-list-drop-variables-add-new-ones and Using-lapply-with-changing-arguments

I have tried two similar methods, the first using the index list:

df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1,df2)

# Adds a new coloumn 'id' with a value of 5 to every row in every dataframe.
# I WANT to change the value based on the list index.
mylist1 <- lapply(mylist, 
    function(x){
        x$id <- 5
        return (x)
    }
)
#Example of what I WANT, instead of '5'.
#> mylist1
#[[1]]
  #x  y id
#1 1 11  1
#2 2 12  1
#3 3 13  1
#4 4 14  1
#5 5 15  1
#
#[[2]]
  #x  y id
#1 1 11  2
#2 2 12  2
#3 3 13  2
#4 4 14  2
#5 5 15  2

The second attempts to pass the names() of the list.

# I WANT it to add a new coloumn 'id' with the name of the respective dataframe
# to every row in every dataframe.
mylist2 <- lapply(names(mylist), 
    function(x){
        portfolio.results[[x]]$id <- "dataframe name here"
        return (portfolio.results[[x]])
    }
)
#Example of what I WANT, instead of 'dataframe name here'.
# mylist2
#[[1]]
  #x  y id
#1 1 11  df1
#2 2 12  df1
#3 3 13  df1
#4 4 14  df1
#5 5 15  df1
#
#[[2]]
  #x  y id
#1 1 11  df2
#2 2 12  df2
#3 3 13  df2
#4 4 14  df2
#5 5 15  df2

But the names() function doesn't work on a list of dataframes; it returns NULL. Could I use seq_along(mylist) in the first example.

Any ideas or better way to handle the whole "merge with source id"

Edit - Added Solution below: I've implemented a solution using Hadleys suggestion and Tommy’s nudge which looks something like this.

files.to.read <- list.files(datafolder, pattern="\\_D.csv$", full.names=FALSE)
mylist <- llply(files.to.read, read.csv)
all <- do.call("rbind", mylist)
all$id <- rep(files.to.read, sapply(mylist, nrow))

I used the files.to.read vector as the id for each dataframe

I also changed from using merge_recurse() as it was very slow for some reason.

 all <- merge_recurse(mylist)

Thanks everyone.

413

asked Aug 16 '11 05:08

Look Left

3 Answers

Personally, I think it's easier to add the names after collapse:

df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)

all <- do.call("rbind", mylist)
all$id <- rep(names(mylist), sapply(mylist, nrow))

173

answered Sep 27 '22 18:09

hadley

Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list.

Both solutions below use the fact that lapply can pass extra parameters (mylist) to the function.

df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1=df1,df2=df2) # Name each data.frame!
# names(mylist) <- c("df1", "df2") # Alternative way of naming...

# Use indices - and pass in mylist
mylist1 <- lapply(seq_along(mylist), 
        function(i, x){
            x[[i]]$id <- i
            return (x[[i]])
        }, mylist
)

# Now the names work - but I pass in mylist instead of using portfolio.results.
mylist2 <- lapply(names(mylist), 
    function(n, x){
        x[[n]]$id <- n
        return (x[[n]])
    }, mylist
)

answered Sep 27 '22 20:09

Tommy

names() could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices:

> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) }
> mylist
[[1]]
  x  y id
1 1 11  1
2 2 12  1
3 3 13  1
4 4 14  1
5 5 15  1

[[2]]
  x  y id
1 1 11  2
2 2 12  2
3 3 13  2
4 4 14  2
5 5 15  2

answered Sep 27 '22 20:09

IRTFM

Related questions
                            
                                Is it possible to return part of a list by reference?
                            
                                Freeze in Python?
                            
                                Getting a list of values from a map
                            
                                UNIX / BASH: Listing files modified in specific month
                            
                                Sort dictionary into list
                            
                                How does `Java` `List` method `size` work?
                            
                                How do I convert from List<?> to List<T> in Java using generics?
                            
                                Android save List<String>
                            
                                Remove multiple items from list in Python [closed]
                            
                                Print original input order of dictionary in python
                            
                                Read lists into columns of pandas DataFrame
                            
                                removing duplicates of a list of sets
                            
                                Match a vector to a list of vectors
                            
                                Joining elements in a list without the join command
                            
                                Remove an item in list and get a new list? [duplicate]
                            
                                How to print the progress of a list comprehension in python?
                            
                                scheme list equivalence comparison
                            
                                Remove duplicates in list (Prolog)
                            
                                Convert list to params C#
                            
                                Fast way to remove a few items from a list/queue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With