Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge rows in a dataframe where the rows are disjoint and contain NAs

I have a dataframe that has two rows:

| code | name  | v1 | v2 | v3 | v4 |
|------|-------|----|----|----|----|
| 345  | Yemen | NA | 2  | 3  | NA |
| 346  | Yemen | 4  | NA | NA | 5  |

Is there an easy way to merge these two rows? What if I rename "345" in "346", would that make things easier?

like image 629
LukasKawerau Avatar asked Jan 10 '13 22:01

LukasKawerau


People also ask

How do you join data frames with different number of rows?

Use the left_join Function to Merge Two R Data Frames With Different Number of Rows. left_join is another method from the dplyr package. It takes arguments similar to the full_join function, but left_join extracts all rows from the first data frame and all columns from both of them.

How do I merge rows in a DataFrame in R?

To merge two data frames (datasets) horizontally, use the merge() function in the R language. To bind or combine rows in R, use the rbind() function. The rbind() stands for row binding.

Which function is used to merge two data frames?

Key PointsPandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

How do I merge rows by column values?

First, select the rows you want to merge then open the Home tab and expand Merge & Centre. From these options select Merge Cells. After selecting Merge Cells it will pop up a message which values it is going to keep. Then click on OK.


2 Answers

You can use aggregate. Assuming that you want to merge rows with identical values in column name:

aggregate(x=DF[c("v1","v2","v3","v4")], by=list(name=DF$name), min, na.rm = TRUE)
   name v1 v2 v3 v4
1 Yemen  4  2  3  5

This is like the SQL SELECT name, min(v1) GROUP BY name. The min function is arbitrary, you could also use max or mean, all of them return the non-NA value from an NA and a non-NA value if na.rm = TRUE. (An SQL-like coalesce() function would sound better if existed in R.)

However, you should check first if all non-NA values for a given name is identical. For example, run the aggregate both with min and max and compare, or run it with range.

Finally, if you have many more variables than just v1-4, you could use DF[,!(names(DF) %in% c("code","name"))] to define the columns.

like image 82
Daniel Sparing Avatar answered Oct 01 '22 21:10

Daniel Sparing


Adding dplyr & data.table solutions for completeness

Using dplyr::coalesce()

library(dplyr)

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

df %>% 
  group_by(name) %>% 
  summarise_all(sum_NA)
#> # A tibble: 1 x 6
#>   name   code    v1    v2    v3    v4
#>   <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Yemen   691     4     2     3     5

# Ref: https://stackoverflow.com/a/45515491
# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
  return(dplyr::coalesce(!!! as.list(df)))
}

df %>% 
  group_by(name) %>% 
  summarise_all(coalesce_by_column)
#> # A tibble: 1 x 6
#>   name   code    v1    v2    v3    v4
#>   <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Yemen   345     4     2     3     5

Using data.table

# Ref: https://stackoverflow.com/q/28036294/
library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = name]
#>     name code v1 v2 v3 v4
#> 1: Yemen  345  4  2  3  5
#> 2: Yemen  346  4  2  3  5

setDT(df)[, code := NULL][, lapply(.SD, na.omit), by = name]    
#>     name v1 v2 v3 v4
#> 1: Yemen  4  2  3  5

setDT(df)[, code := NULL][, lapply(.SD, sum_NA), by = name]
#>     name v1 v2 v3 v4
#> 1: Yemen  4  2  3  5
like image 45
Tung Avatar answered Oct 01 '22 20:10

Tung