This is an optimization question, this is my current (working) situation:
The client request is to perform a SQL query, and have the results displayed on a web page in some tables, every one representing a metric with a counter, like:
Age => 18-20: 15 users, 21-35: 42 users, ...
Country => USA: 22 users, GB: 0 users, ...
And so on. Counting all tables (together with some about users' session, which are generated based on activity, period and year) we have currently ~200 metrics.
The last released system in production uses (considering df as the DataFrame resulting from SQL query):
df.rdd.aggregate(metricsMap) (
(acc: MetricsMap, r:Row) => {
acc.analyzeRow(r)
acc
},
(acc1: MetricsMap, acc2: MetricsMap) => {
acc1.merge(acc2)
acc1
}
)
Where MetricsMap is an object used to extract and aggregate data from the row.
This operation is very CPU intensive, and on the server it takes ~ 20 seconds to extract the data from a query without parameters (so from all data in the parquet file).
I've decided to use aggregation because, for their statistical analysis, they wanted multiple prospectives: some metrics should be counted by user key, another by user name (for the site...) and the other by product key. Using this method I had to cycle through the result only one time, but I don't know if it's the better approach...
Is this the better approach, or it exists some other (faster) method to obtain the same result?
Regarding the question about calculate metrics beforehand, the queries that they can do on the dataset are not bound, so I don't know if this is possible or not... Could you please give me an example?
Answering to some questions
One path of the data in such cases is obviously better than have multiple cycles, in cases you want to answer multiple queries for the same request.
Might be more efficient though not to interop with spark core.
For example, if your DF schema is as follows:
root
-- age
-- country
Then you might try to do the following pseudo base query:
Select
CASE WHEN (age BETWEEN 18 AND 22) THEN '18-22'
WHEN (age BETWEEN 22 AND 30) THEN '22-30'
ELSE 'Other' as age_group,
country
from metrics_df
You could also consider using a UDF for the age group. As @assaf-mendelson mentioned, more info would be useful here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With