I'm very new to R so this may be a simple question. I have a table of data that contains frequency counts of species like this: <pre class="prettyprint"><code> Acidobacteria 47 Actinobacteria 497 Apicomplexa 7 Aquificae 16 Arthropoda 26 Ascomycota 101 Bacillariophyta 1 Bacteroidetes 50279 ... </code></pre> There are about 50 species in the table. As you can see some of the values are a lot larger than the others. I would like to have a stacked barplot with the top 5 species by percentage and one category of 'other' that has the sum of all the other percentages. So my barplot would have 6 categories total (top 5 and other). I have 3 additional datasets (sample sites) that I would like to do the same thing to only highlighting the first dataset's top 5 in each of these datasets and put them all on the same graph. The final graph would have 4 stacked bars showing how the top species in the first dataset change in each additional dataset. I made a sample plot by hand (tabulated the data outside of R and just fed in the final table of percentages) to give you an idea of what I'm looking for: http://dl.dropbox.com/u/1938620/phylumSum2.jpg I would like to put these steps into an R script so I can create these plots for many datasets. Thanks!

Say your data is in the data.frame <code>DF</code> <pre class="prettyprint"><code>DF <- read.table(textConnection( "Acidobacteria 47 Actinobacteria 497 Apicomplexa 7 Aquificae 16 Arthropoda 26 Ascomycota 101 Bacillariophyta 1 Bacteroidetes 50279"), stringsAsFactors=FALSE) names(DF) <- c("Species","Count") </code></pre> Then you can determine which species are in the top 5 by <pre class="prettyprint"><code>top5Species <- DF[rev(order(DF$Count)),"Species"][1:5] </code></pre> Each of the data sets can then be converted to these 5 and "Other" by <pre class="prettyprint"><code>DF$Group <- ifelse(DF$Species %in% top5Species, DF$Species, "Other") DF$Group <- factor(DF$Group, levels=c(top5Species, "Other")) DF.summary <- ddply(DF, .(Group), summarise, total=sum(Count)) DF.summary$prop <- DF.summary$total / sum(DF.summary$total) </code></pre> Making <code>Group</code> a factor keeps them all in the same order in <code>DF.summary</code> (largest to smallest per the first data set). Then you just put them together and plot them as you did in your example.

<h3>We should make it a habit to use data.table wherever possible:</h3> <pre class="prettyprint"><code>library(data.table) DT<-data.table(DF,key="Count") DT[order(-rank(Count), Species)[6:nrow(DT)],Species:="Other"] DT<-DT[, list(Count=sum(Count),Pcnt=sum(Count)/DT[,sum(Count)]),by="Species"] </code></pre>

plotting the top 5 values from a table in R

Tags:

plot

r

I'm very new to R so this may be a simple question. I have a table of data that contains frequency counts of species like this:

  Acidobacteria              47
  Actinobacteria            497
  Apicomplexa                 7
  Aquificae                  16
  Arthropoda                 26
  Ascomycota                101
  Bacillariophyta             1
  Bacteroidetes           50279
  ...

There are about 50 species in the table. As you can see some of the values are a lot larger than the others. I would like to have a stacked barplot with the top 5 species by percentage and one category of 'other' that has the sum of all the other percentages. So my barplot would have 6 categories total (top 5 and other).

I have 3 additional datasets (sample sites) that I would like to do the same thing to only highlighting the first dataset's top 5 in each of these datasets and put them all on the same graph. The final graph would have 4 stacked bars showing how the top species in the first dataset change in each additional dataset.

I made a sample plot by hand (tabulated the data outside of R and just fed in the final table of percentages) to give you an idea of what I'm looking for: http://dl.dropbox.com/u/1938620/phylumSum2.jpg

I would like to put these steps into an R script so I can create these plots for many datasets.

Thanks!

856

asked Sep 07 '11 18:09

helicase

2 Answers

Say your data is in the data.frame DF

DF <- read.table(textConnection(
"Acidobacteria              47
Actinobacteria            497
Apicomplexa                 7
Aquificae                  16
Arthropoda                 26
Ascomycota                101
Bacillariophyta             1
Bacteroidetes           50279"), stringsAsFactors=FALSE)
names(DF) <- c("Species","Count")

Then you can determine which species are in the top 5 by

top5Species <- DF[rev(order(DF$Count)),"Species"][1:5]

Each of the data sets can then be converted to these 5 and "Other" by

DF$Group <- ifelse(DF$Species %in% top5Species, DF$Species, "Other")
DF$Group <- factor(DF$Group, levels=c(top5Species, "Other"))
DF.summary <- ddply(DF, .(Group), summarise, total=sum(Count))
DF.summary$prop <- DF.summary$total / sum(DF.summary$total)

Making Group a factor keeps them all in the same order in DF.summary (largest to smallest per the first data set).

Then you just put them together and plot them as you did in your example.

126

answered Oct 23 '22 03:10

Brian Diggs

We should make it a habit to use data.table wherever possible:

library(data.table)
DT<-data.table(DF,key="Count")
DT[order(-rank(Count), Species)[6:nrow(DT)],Species:="Other"]
DT<-DT[, list(Count=sum(Count),Pcnt=sum(Count)/DT[,sum(Count)]),by="Species"]

answered Oct 23 '22 04:10

andrekos

Related questions
                            
                                How to include multiple tables programmatically into a Sweave document using R
                            
                                How to adjust line size in geom_line without obtaining another (useless) legend?
                            
                                Plotting a box within filled.contour plots in R?
                            
                                How to run R script with gdb attached?
                            
                                R script line numbers at error? [duplicate]
                            
                                Prevent print() from outputting list indices in R
                            
                                Time-based averaging (sliding window) of columns in a data.frame
                            
                                Should I prefer hadoop vs condor when working with R?
                            
                                Have R look for files in a library directory
                            
                                Limit lattice plots to viewports?
                            
                                What debugging tools does R lack that other languages have? [closed]
                            
                                R - legend: assign multiple colours to the same text
                            
                                plotting a graph with date on the x-axis in R
                            
                                Improving a function to get stock news data from google in R
                            
                                Is there a dynamic word/tag cloud Java API somewhere? [closed]
                            
                                How can I make this R matrix filling function faster?
                            
                                how to get sweave to center figures without centering code
                            
                                Randomly selecting values from an existing matrix after adding a vector (in R)
                            
                                R apply error: 'X' must have named dimnames
                            
                                R.h and Rmath.h in native C program

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With