Subset data /extracting data based on first 7 letters

Tags:

I have a huge data set with genotypic information from different populations. I would like to sort the data by population, but I don't know how.

I would like to sort by "pedigree_dhl". I was using the following code, but I kept getting error messages.

newdata <- project[pedigree_dhl == CCB133$*1,  ]

My problem is also, that 'pedigree-dhl' contains all the names of the individual genotypes. Only the first 7 letters in the column 'pedigree-dhl' are the population name.In this example:CCB133. How can I tell R, that I want to extract the data for all columns, that contain CCB133?

  Allele1 Allele2      SNP_name gs_entry pedigree_dhl
1       T       T ZM011407_0151      656    CCB133$*1
2       T       T ZM009374_0354      656    CCB133$*1
3       C       C ZM003499_0591      656    CCB133$*1
4       A       A ZM003898_0594      656    CCB133$*1
5       C       C ZM004887_0313      656    CCB133$*1
6       G       G ZM000583_1096      656    CCB133$*1

733

asked Apr 25 '12 16:04

marie

1 Answers

You may want to consider grep as in the answer on Using regexp to select rows in R dataframe. Adapted to your data:

df <- read.table(text="  Allele1 Allele2      SNP_name gs_entry pedigree_dhl
1       T       T ZM011407_0151      656    CCB133$*1
2       T       T ZM009374_0354      656    CCB133$*1
3       C       C ZM003499_0591      656    CCB133$*1
4       A       A ZM003898_0594      656    CCB133$*1
5       C       C ZM004887_0313      656    CCB133$*1
6       G       G ZM000583_1096      656    CCB133$*1", header=T)

# put into df1 all rows where pedigree_dhl starts with CCB133$
p1 <- 'CCB133$'
df1 <- subset(df, grepl(p1, pedigree_dhl) )

But your question implies that you may want to select out the seven letter name, or just to sort the rows by pedigree name and it may be easier to keep all rows together in a sorted dataframe. All these three operations: sub-setting, extracting a new column, or sorting, may be carried out independently.

# If you want to create a new column based
# on the first seven letter of SNP_name (or any other variable)

df$SNP_7 <- substr(df$SNP_name, start=1, stop=7)

# If you want to order by pedigree_dhl
# then you don't need to select out the rows into a new dataframe

df <- df[ with(df, order(df$pedigree_dhl)), ]

All this may be obvious -- I add them simply for completeness.

149

answered Oct 19 '22 15:10

daedalus

Related questions
                            
                                Create polygons representing bounding boxes for subgroups using sf
                            
                                How do I keep my subtitles when I use ggplotly()
                            
                                How to locate errors and debug when using purrr
                            
                                Line density heatmap in R
                            
                                How to run function on the deepest level only in a nested list?
                            
                                Using pivot_longer with multiple paired columns in the wide dataset
                            
                                Names of nested list containing dots (e.g. "c.2)
                            
                                Formula for all first and second order predictors including interactions in R
                            
                                Drawing a heatmap in R based on zipcodes only
                            
                                How can I change the default theme in ggplot2?
                            
                                Calculate monthly average of ts object
                            
                                How to improve a spatial raster map using ggplot when compared to spplot?
                            
                                plot function does not take plot type into account in R language
                            
                                have R halt the EC2 machine it's running on
                            
                                Make R (statistics package) wait for keyboard prompt when run within a bash script
                            
                                save yaxis legends as a separate grob?
                            
                                Simple if-else loop in R
                            
                                How can I use different color palettes for different layers in ggplot2?
                            
                                Getting both column counts and proportions in the same table in R
                            
                                Accessing google docs revision history through the API using R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Subset data /extracting data based on first 7 letters

Tags:

r

subset

names

marie

People also ask

1 Answers

daedalus

Recent Activity

Donate For Us