I'm a rookie in R and currently working with collaboration data in the form of an edge list with 32 columns and around 200.000 rows. I want to create a (co-)occurrence matrix based on the interaction between countries. However, I want to count the number of interactions by the total number of an object.
If in one row "England" occurs three times and "China" only one time, the result should be the following matrix.
England China
England 3 3
China 3 1
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
Accordingly, an example data frame currently looks like this:
ID V1 V2 ... V32
1 England Greece USA
2 England England China
3 China Greece Greece
4 England England England
.
.
.
I want to count (co-)occurrences row-wise and independent of order to get a (co-)occurrence matrix that accounts for low frequencies of edge loops (e.g. England-England), which leads to the following result:
China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
I've used igraph
to get an adjacency matrix with co-occurrences. However, it calculates - as supposed to - not more than two interactions of the same two objects, leaving me with values far below actual frequency of objects by row/publication in some cases.
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
# remove ID column
df[1] <- list(NULL)
# calculate co-occurrences and return as dataframe
library(igraph)
library(Matrix)
countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))
China England Greece USA
China 0 0 1 0
England 0 2 1 0
Greece 1 1 0 0
USA 0 0 0 0
I assume there has to be an easy solution using base
and/or dplyr
and /or table
and/or reshape2
similar to [1], [2], [3], [4] or [5] but nothing has done the trick so far and I was not able to adjust the code to my needs. I've also tried to use [6] as a basis, however, the same issue applies here, too.
library(tidry)
library(dplyr)
library(stringr)
# collapse observations into one column
df2 <- df %>% unite(concat, V1:V32, sep = ",")
# calculate weights
df3 <- df2$concat %>%
str_split(",") %>%
lapply(function(x){
expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
}) %>%
bind_rows
df4 <- apply(df3[, -5], 1, sort) %>%
t %>%
data.frame(stringsAsFactors = FALSE) %>%
mutate(w = df3$w)
I'd be glad if someone could point me in the right direction.
There may be better ways to do this, but try:
library(tidyverse)
df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>%
crossprod(., .)
df_diag <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>%
diag()
diag(df1) <- df_diag
df1
Country China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With