First, here is part of mydata(121315*4):
LONGITUDE LATITUDE NUM_PICKUPS TOTAL_REVENUE
1 121.6177 38.9124 21 337.0
2 121.8069 39.0210 16 454.7
3 121.5723 38.9645 38 696.9
4 121.6423 38.9258 622 13609.7
5 121.5647 38.9129 116 2016.7
6 121.6429 38.8846 120 2417.3
7 121.5852 38.9279 117 1975.0
8 121.6616 38.9189 94 1712.4
9 121.5812 38.9828 50 981.6
10 121.6411 38.9255 225 4696.2
Seeing that, the first and second column is the longitude and latitude.
mydata[1,3]=21
means that in the palce(121.6177, 38.9124)
, there are 21 pickups.
Then, I resort mydata with NUM_PICKUPS
desc:
LONGITUDE LATITUDE NUM_PICKUPS TOTAL_REVENUE
121.6019 39.0181 14243 514716
121.5382 38.9609 13244 443754.7
121.5381 38.9609 9645 325056
121.5382 38.9608 8846 294345.6
121.602 39.0181 6556 232254.5
121.5383 38.9609 6152 208967.6
121.5383 38.9608 6014 207677.8
121.5381 38.9608 5544 185398.3
121.6018 39.018 4546 167662.1
121.5382 38.9607 4260 143088.9
121.5827 38.8948 4133 72202.8
121.6303 38.9183 3837 67683.6
121.5966 38.9665 3747 56378.7
And there is the summary of mydata:
summary(mydata)
LONGITUDE LATITUDE NUM_PICKUPS TOTAL_REVENUE
Min. :121.1 Min. :38.76 Min. : 10.00 Min. : 92.9
1st Qu.:121.6 1st Qu.:38.91 1st Qu.: 15.00 1st Qu.: 289.7
Median :121.6 Median :38.92 Median : 27.00 Median : 515.1
Mean :121.6 Mean :38.93 Mean : 57.03 Mean : 1067.6
3rd Qu.:121.6 3rd Qu.:38.96 3rd Qu.: 59.00 3rd Qu.: 1089.5
Max. :122.0 Max. :39.32 Max. :14243.00 Max. :514716.0
Now, I want to draw the map which is colored by NUM_PICKUPS
, look at my codes.
g1 <- ggplot() + geom_point(data = mydata,aes(x = LONGITUDE,y = LATITUDE,color=NUM_PICKUPS))
Yeah, both the codes and graph are right, but look the color, it's hard to indentify where is the place with high num_pickups
? And where is less?
I try to modify my codes with scale_colour_gradient()
:
g1 + scale_colour_gradient(low = "red",high = "white")
But look the picture, the color is also hard to classify .
Third try: This time I add parameters of alpha=I(1/100)
and breaks()
:
g1 <- ggplot() + geom_point(data = mydata,aes(x = LONGITUDE,y = LATITUDE,color=NUM_PICKUPS),alpha=I(1/100))
g1 + scale_colour_gradient(low = "red",high = "white", breaks=c(0,2000,4000))
But it's still helpless!
Fourth try:
ggplot(data = mydata, aes(x = LONGITUDE,y = LATITUDE, color = NUM_PICKUPS)) + geom_point() + scale_colour_gradient(limits = c(0, 60))
Fifth Try: According to the post 3 years ago, ggplot2 Color Scale Over Affected by Outliers, I try to modify my codes again:
mydata$NUM_PICKUPS1 <- "> 2000"
mydata$NUM_PICKUPS1[mydata$NUM_PICKUPS <= 2000] <- NA
g2 <- ggplot() + geom_point(data = subset(mydata,NUM_PICKUPS <= 2000),
aes(x = LONGITUDE,y = LATITUDE,color=NUM_PICKUPS),size=2) + geom_point(data = subset(mydata,NUM_PICKUPS > 2000),aes(x = LONGITUDE,y = LATITUDE,fill=NUM_PICKUPS1))
Something did change in the outliers, but the color scale is still hard to classify!
So, my question is how to modify my codes to make the color of NUM_PICKUPS
easily to identify?
As per my comment and your response, I think the problem is that you have some outliers that are forcing the scale to expand to accommodate them.
From your summary()
, 75% of your cases of NUM_PICKUPS
are between 10 and 59. The remaining 25% then increases to 14243, three orders of magnitude greater!
To summarise, the range of your values of NUM_PICKUPS
is too great to show variation at anything below about 1,000.
The solution you choose will depend on your data and what you want to do with it. One option is to simply show only the values up to 75% and exclude the highest 25% as outliers. You could do this without altering the data by manually setting the limits with, I think:
g1 + scale_colour_gradient(limits = c(0, 60))
Another option would be to transform your data (perhaps with log()
or log10()
). For example, mydata$LOG_PICKUPS <- log10(mydata$NUM_PICKUPS)
might help reduce the range sufficiently to plot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With