I have a huge data set with genotypic information from different populations. I would like to sort the data by population, but I don't know how.
I would like to sort by "pedigree_dhl". I was using the following code, but I kept getting error messages.
newdata <- project[pedigree_dhl == CCB133$*1, ]
My problem is also, that 'pedigree-dhl' contains all the names of the individual genotypes. Only the first 7 letters in the column 'pedigree-dhl' are the population name.In this example:CCB133. How can I tell R, that I want to extract the data for all columns, that contain CCB133?
Allele1 Allele2 SNP_name gs_entry pedigree_dhl
1 T T ZM011407_0151 656 CCB133$*1
2 T T ZM009374_0354 656 CCB133$*1
3 C C ZM003499_0591 656 CCB133$*1
4 A A ZM003898_0594 656 CCB133$*1
5 C C ZM004887_0313 656 CCB133$*1
6 G G ZM000583_1096 656 CCB133$*1
Subset a Data Frame with Base R Extract[] To specify a logical expression for the rows parameter, use the standard R operators. If subsetting is done by only rows or only columns, then leave the other value blank. For example, to subset the d data frame only by rows, the general form reduces to d[rows,] .
How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.
You may want to consider grep
as in the answer on Using regexp to select rows in R dataframe. Adapted to your data:
df <- read.table(text=" Allele1 Allele2 SNP_name gs_entry pedigree_dhl
1 T T ZM011407_0151 656 CCB133$*1
2 T T ZM009374_0354 656 CCB133$*1
3 C C ZM003499_0591 656 CCB133$*1
4 A A ZM003898_0594 656 CCB133$*1
5 C C ZM004887_0313 656 CCB133$*1
6 G G ZM000583_1096 656 CCB133$*1", header=T)
# put into df1 all rows where pedigree_dhl starts with CCB133$
p1 <- 'CCB133$'
df1 <- subset(df, grepl(p1, pedigree_dhl) )
But your question implies that you may want to select out the seven letter name, or just to sort the rows by pedigree name and it may be easier to keep all rows together in a sorted dataframe. All these three operations: sub-setting, extracting a new column, or sorting, may be carried out independently.
# If you want to create a new column based
# on the first seven letter of SNP_name (or any other variable)
df$SNP_7 <- substr(df$SNP_name, start=1, stop=7)
# If you want to order by pedigree_dhl
# then you don't need to select out the rows into a new dataframe
df <- df[ with(df, order(df$pedigree_dhl)), ]
All this may be obvious -- I add them simply for completeness.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With