Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark, optimize metrics generation from DF

This is an optimization question, this is my current (working) situation:

  • Spark is running in stand-alone mode using spark-jobserver;
  • I've a parquet file with ~3M of rows cached in memory as a table;
  • The table is a comprensive of all data from an ecommerce site, every row represents a user, but a user can have more rows;

The client request is to perform a SQL query, and have the results displayed on a web page in some tables, every one representing a metric with a counter, like:

Age => 18-20: 15 users, 21-35: 42 users, ...

Country => USA: 22 users, GB: 0 users, ...

And so on. Counting all tables (together with some about users' session, which are generated based on activity, period and year) we have currently ~200 metrics.

The last released system in production uses (considering df as the DataFrame resulting from SQL query):

df.rdd.aggregate(metricsMap) (

      (acc: MetricsMap, r:Row) => {
        acc.analyzeRow(r)
        acc
      },

      (acc1: MetricsMap, acc2: MetricsMap) => {
        acc1.merge(acc2)
        acc1
      }
    ) 

Where MetricsMap is an object used to extract and aggregate data from the row.

This operation is very CPU intensive, and on the server it takes ~ 20 seconds to extract the data from a query without parameters (so from all data in the parquet file).

I've decided to use aggregation because, for their statistical analysis, they wanted multiple prospectives: some metrics should be counted by user key, another by user name (for the site...) and the other by product key. Using this method I had to cycle through the result only one time, but I don't know if it's the better approach...

Is this the better approach, or it exists some other (faster) method to obtain the same result?

Regarding the question about calculate metrics beforehand, the queries that they can do on the dataset are not bound, so I don't know if this is possible or not... Could you please give me an example?

Answering to some questions

like image 952
Marco Fedele Avatar asked Jan 14 '17 19:01

Marco Fedele


1 Answers

One path of the data in such cases is obviously better than have multiple cycles, in cases you want to answer multiple queries for the same request.

Might be more efficient though not to interop with spark core.

For example, if your DF schema is as follows:

root
 -- age
 -- country

Then you might try to do the following pseudo base query:

Select 
CASE WHEN (age BETWEEN 18 AND 22) THEN '18-22' 
     WHEN (age BETWEEN 22 AND 30) THEN '22-30' 
     ELSE 'Other' as age_group,
country
from metrics_df

You could also consider using a UDF for the age group. As @assaf-mendelson mentioned, more info would be useful here.

like image 122
Lior Chaga Avatar answered Sep 17 '22 19:09

Lior Chaga