Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scatter Plot with Varying Point Sizes

Tags:

r

I am a new R user so please forgive me if my question seems simple. Despite me researching the Cookbook and The Handbook of Statistical Analysis, I have been unable to construct a particular graph to my liking.

The two columns I am trying to graph are Age and Income. Age takes an integer value (40, 34, 50, ...) while income take a binary value (<= 50k, >=50k). There are 32561 rows of data with varying ages. I would like to create a plot with age for the X-axis and the income binary variable as my Y-axis, plot(age,income). This of course leads to a plot with two parallel lines since income is a binary variable which is fine. The information that I am trying to get from the plot is the number of people of a given age that fall into either of the income buckets. They way I would like to do this is by having circles sizes proportional to the number of people at a certain age within each income class. For example, if there were 700 people at age 25 that were in the <=50k bracket and 150 that fell into the other bracket, the size of the two points would vary based on the number of people. Therefore the 700 people that fell into the <=50k bucket would be represented by a large circle and the latter a much smaller circle. I would like to do this for all ages... I hope this makes sense. Please let me know if clarification is needed. Thanks! I am sure you will hear from me again in the not too distant future.

like image 257
user2214069 Avatar asked Dec 11 '22 16:12

user2214069


1 Answers

It's easier to answer these questions with example data, but in this case it was easy enough to come up with something that roughly reflected the problem:

age = rep(c(20, 30, 40, 50, 60), 20)
income = c(rep(">50k", 80), rep("<50k", 20))

df1 = data.frame(age=age, income=income)

First we generate a summary of the data, getting the count of people at each combination of age and income:

library(plyr)
df1_summary = ddply(
  df1,
  .(age, income),
  summarize,
  count=length(income)
  )

Then it's easy to plot using ggplot2:

ggplot(df1_summary, aes(age, income, size=count)) +
  geom_point()

size mapped to counts

like image 197
Marius Avatar answered Jan 03 '23 19:01

Marius