I have a dataframe such as :
Cluster sequence_name
1 species1
1 species1
1 species2
1 species3
1 species3
1 gene1
1 gene2
2 species4
2 species5
2 spciess5
2 species3
2 gene3
2 gene4
and I would like to get a matrix with it such as :
gene1 gene2 gene3 gene4
species5 0 0 1 1
species4 0 0 1 1
species1 1 1 0 0
species2 1 1 0 0
species3 1 1 1 1
where 1 means that for the speciesX the gene is present, and 0 means it is nos present.
Present means that the speciesX is present in the same cluster than a geneX. For exemple, gene1 is present in the cluster1 as the species1, 2 and 3.
In contrary, species5 and 4 are notre present in the cluster1.
As you can also see; there are several duplicates (in the same cluster, a species can be representated several times). Thank you for your help.
The real data looks like:
cluster_names seq_names
1 AP_000401.1
1 NP_039001.1
1 Canis_lupus
1 Canis_familiaris
2 YP_0090909.1
2 Mustela_putorius
2 Mustela_furo
2 YP_0909200.1
....
...
AP and NP and other XX letters are genes and the Genus_specie the species
In response to Denis:
Here is a head of the real data:
cluster_names seq_names
1 scf7180005155889:2745-3053(-):Drosophia_melanogaster
1 IDBA_scaffold_72878:85-225:292707-293006(+):Orussu_sp
1 scaffold_3615:40850-41320(-):Canis_lupus
1 scaffold_8697:754-1209(-):homo_sapiens
1 scf7180005155889:72-1908(-):homo_sapiens
1 YP_003969716.1
1 NP_003986717.1
2 scaffold_17536:2745-3053(-):Drosophia_melanogaster
2 scf7180005155889:2000-8900(-):Drosophia_melanogaster
2 scaffold_8697:754-1209(-):homo_sapiens
2 YP_003956764.1
2 YP_004894416.1
2 YP_008958968.1
and the output I should get is :

In respons to Denis:
> df <- read.table(text = "Cluster sequence_name
+ 1 :Drosophia_melanogaster
+ 1 scf7180005155889:2745-3053(-):Drosophila_melanogaster
+ 1 scf7180005155889:2745-3053(-):Orussu_sp
+ 1 scf7180005155889:2745-3053(-):Canis_lupus
+ 1 scf7180005155889:72-1908(-):Homo_sapiens
+ 1 scf7180005155889:2745-3053(-):Homo_sapiens
+ 1 YP_003970075.1
+ 1 YP_005070075.1
+ 2 scf7180005155889:72-1908(-):Drosophila_melanogaster
+ 2 scf7180005155889:72-1908(-):Drosophila_melanogaster
+ 2 scf7180005155889:72-1908(-):Homo_sapiens
+ 2 YP_039970075.1
+ 2 NP_003900075.1",header = T)
> df <- setDT(df)
> species <- df[grep("[0-9]+\\([+-]\\):[A-z ]+",sequence_name)]
> species[,sequence_name := str_extract(sequence_name,"(?<=[0-9]\\([+-]\\):)[A-z ]+")]
> genes <- df[grep("[0-9]+\\.1",sequence_name)]
> genes[,sequence_name :=sequence_name]
> plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)
> result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
Using 'sequence_name.y' as value column. Use 'value.var' to override
> row.names(result)<-result$sequence_name.y
> result$sequence_name.y<- NULL
> result
NP_003900075.1 YP_003970075.1 YP_005070075.1 YP_039970075.1
1: 0 1 1 0
2: 2 1 1 2
3: 1 2 2 1
4: 0 1 1 0
library(data.table)
library(stringr)
df <- setDT(df)
I will use data.table here. So the idea is to create two data frame, one with the genes, one with the species
species <- df[grep("species",sequence_name)]
species[,sequence_name := str_extract(sequence_name,"(?<=:)[a-z0-9]+$")]
genes <- df[grep("gene",sequence_name)]
> species
Cluster sequence_name
1: 1 species1
2: 1 species2
3: 1 species3
4: 2 species4
5: 2 species5
6: 2 species3
> genes
Cluster sequence_name
1: 1 gene1
2: 1 gene2
3: 2 gene3
4: 2 gene4
You want to merge them together by cluster, with allow.cartesian=TRUE because your merging vector is not a single identifier for none of your data.frame:
plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)
Cluster sequence_name.x sequence_name.y
1: 1 gene1 species1
2: 1 gene1 species2
3: 1 gene1 species3
4: 1 gene2 species1
5: 1 gene2 species2
6: 1 gene2 species3
7: 2 gene3 species4
8: 2 gene3 species5
9: 2 gene3 species3
10: 2 gene4 species4
11: 2 gene4 species5
12: 2 gene4 species3
Then, obtaining your result is just going to wide format while counting the number of occurence, which you can do with dcast here:
result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
sequence_name.y gene1 gene2 gene3 gene4
1: species1 1 1 0 0
2: species2 1 1 0 0
3: species3 1 1 1 1
4: species4 0 0 1 1
5: species5 0 0 1 1
Et voilà. I let dplyr experienced users to propose the equivalent/improved solution with dplyr.
df <- read.table(text = "Cluster sequence_name
1 Scaffold_1:species1
1 Scaffold_2:species2
1 Scaffold_3:species3
1 gene1
1 gene2
2 Scaffold_4:species4
2 Scaffold_5:species5
2 Scaffold_6:species3
2 gene3
2 gene4",header = T)
With the real data you show:
df <- read.table(text ="cluster_names seq_names
1 scf7180005155889:2745-3053(-):Drosophia_melanogaster
1 scaffold_2484:292707-293006(+):Orussu_sp
1 scaffold_3615:40850-41320(-):Canis_lupus
1 scaffold_8697:754-1209(-):homo_sapiens
1 scf7180005155889:72-1908(-):homo_sapiens
1 YP_003969716.1
1 NP_003986717.1
2 scaffold_17536:2745-3053(-):Drosophia_melanogaster
2 scf7180005155889:2000-8900(-):Drosophia_melanogaster
2 scaffold_8697:754-1209(-):homo_sapiens
2 YP_003956764.1
2 YP_004894416.1
2 YP_008958968.1",header = T)
You should change the step of creating the two data table by:
species <- df[grep("[0-9]+\\([+-]\\):[A-z ]+",seq_names)]
species[,sequence_name := str_extract(seq_names,"(?<=[0-9]\\([+-]\\):)[A-z ]+")]
genes <- df[grep("[0-9]+\\.1",seq_names)]
genes[,sequence_name :=seq_names]
Here "[0-9]+\\.1" suppose that all genes finish with 1, and that there is no point in the species description. To extract the species info, I suppose that it always contain (+): or (-)+ after numbers.
But that is a regex problem, and should be the matter of an other question if you have problem with it. Your question here was to find the way of shaping the data to obtain your result. I answered by giving you the steps working on the example data : creating the two genes and species data frame using regex, merging them and re-shaping them.
The rest works:
plouf <- merge(genes,species,by = "cluster_names",allow.cartesian=TRUE)
result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
Using tidyverse:
# data
df1 <- read.table(text = "Cluster sequence_name
1 species1
1 species1
1 species2
1 species3
1 species3
1 gene1
1 gene2
2 species4
2 species5
2 species5
2 species3
2 gene3
2 gene4", header = TRUE, stringsAsFactors = FALSE)
# so that we know which row is species
species <- paste("species", 1:5, sep = "")
#[1] "species1" "species2" "species3" "species4" "species5"
library(tidyverse)
res <- reduce(split(df1, df1$sequence_name %in% species), left_join, by = "Cluster") %>%
unique() %>%
spread(key = "sequence_name.x", value = "Cluster") %>%
mutate_if(is.numeric, funs(as.numeric(!is.na(.))))
res
# sequence_name.y gene1 gene2 gene3 gene4
# 1 species1 1 1 0 0
# 2 species2 1 1 0 0
# 3 species3 1 1 1 1
# 4 species4 0 0 1 1
# 5 species5 0 0 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With