Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scale all values depending on group [duplicate]

Tags:

r

tapply

scale

I have a dataframe similar to this one

ID <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
p1 <- c(21000, 23400, 26800, 2345, 23464, 34563, 456433, 56543, 34543,3524, 353, 3432, 4542, 6343, 4534 )
p2 <- c(234235, 2342342, 32, 23432, 23423, 2342342, 34, 2343, 23434, 23434, 34, 234, 2343, 34, 5)
my.df <- data.frame(ID, p1, p2)

Now I would like to scale the values in p1 and p2 depending on their ID. So not the whole column would be scaled like when using the tapply() function, but rather scaling is done once for all values for ID 1, then for all values for ID 2 etc. Same for scaling of p2. The new dataframe should consist of the scaled values.

I already tried

df_scaled <- ddply(my.df, my.df$ID, scale(my.df$p1))

but get the error message

.fun is not a function.

Thanks for your help!

like image 748
GNee Avatar asked Jan 20 '17 10:01

GNee


People also ask

How do I scale data based on a specific instance Count?

For Scale mode, select Scale based on a metric. This mode provides dynamic scaling. You can also select Scale to a specific instance count. Select + Add a rule. In the Scale rule section on the right, enter values for each setting. Select an aggregation criteria, such as Average.

Why are frequent cluster scale-out and scale-in operations undesirable?

Frequent cluster scale out and scale in operations are undesirable because of the impact on the cluster's resources and the required time for adding or removing instances, as well as rebalancing the hot cache across all nodes. Predictive logic forecasts the cluster's usage for the next day based on its usage pattern over the last few weeks.

Does the rank function count duplicate values as the same?

After inspecting the outcomes, we can argue that the RANK function ranks the duplicate values as the same. But it then counts the next values’ rank with latter numbers (considering how many duplicates are present in the range). For example, the function goes from rank 1 to 3 as there is a duplicate rank 1 and the function masks it as rank 2.

What is the recommended scaling method for the cluster?

The cluster has a static capacity that doesn't change automatically. You select the static capacity by using the Instance count bar. The cluster's scaling remains at that setting until you make another change. Optimized autoscale is the recommended scaling method. This method optimizes cluster performance and cost, as follows:


1 Answers

dplyr makes this easy:

ID <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
p1 <- c(21000, 23400, 26800, 2345, 23464, 34563, 456433, 56543, 34543,3524, 353, 3432, 4542, 6343, 4534 )
p2 <- c(234235, 2342342, 32, 23432, 23423, 2342342, 34, 2343, 23434, 23434, 34, 234, 2343, 34, 5)
my.df <- data.frame(ID, p1, p2)

library(dplyr)
df_scaled <- my.df %>% group_by(ID) %>% mutate(p1 = scale(p1), p2=scale(p2))

Note that there is a bug in the stable version of dplyr when working with scale; you might need to update to the dev version (see comments).

like image 59
mpjdem Avatar answered Oct 13 '22 17:10

mpjdem