Structuring a Spark Project methodology

Question

I'm currently in my first job that involves Spark. Code is written in Java.

My current task is to continue what was left off by a previous dev.

The task in itself isn't very hard but context:

There are 10+ datasets that come from different sources (files, company's endpoint, tables, etc...)
These datasets have to be joined at the end into a unique dataset.
There are hundreds of "Rules" on those datasets.
Some of these rules can only be applied by specific joins from DsA + DsB for example.

The previous Dev wrote the Rules, the joins in only one specific class (which are getting really big, like thousand liners) with a huge main class too.

I can easily follow the same process and make those classes even bigger but I find it extremely irritating to read and maintain for the future.

For now, I restructured the project in packages like:

read
preprocess
calculations
write

But I also doubt my own methodology and I have no Senior to rely on in the company.

My classes are just utility classes with many (but organized ?) static methods that contain the Rules and I call from a helper class to chain all the processes. I feel like it is kind off "dirty" code but I'm unable to come up with any other solution

I do have a few classes in which I inject a dataset as a parameter so that this dataset serves as a reference in all the methods inside the class. But all the methods are really "specific" for example:

public  class  CurrencyService {    
  private final Dataset<Row> currencyReference;

  // constructor
  public Dataset<Row> applyCurrencyConversion(Dataset<Row> targetDataset){
    // target joins reference and apply conversion
  }

  private Column conversion(Column targetColumn){
    // conversion rules
  }
}

How should I handle Join Keys? Every dataset has a different name for its ID. I was thinking about forcing a key String from a constant for each ID at preprocessing step. Because otherwise, I have to join every dataset manually by knowing their Column's name each time. With preprocessed join keys, I may do something like dsA.join(dsB, Sequence of (joinKeys)).

In the end, I feel like I'm spaghettying the code even more while the previous "single class" style was easier to understand but not to maintain...

I have a background in POO, so am I taking this project too Object-Oriented?

jgp · Accepted Answer

This is a really very broad and impressive question :) - I wish developers have this level of maturity when they start a project.

Pipeline

You are building a pipeline, I recommend you follow:

Ingest (read) the data
Apply data quality rules (make sure the data is ok)
Transform the data (shape it to the way you want)
Perform the join
Save the data as a checkpoint/final result

Big Data pipeline

If you follow this diagram, you should not apply rules after the Gold zone, unless you work on specific sub-projects.

Transformations

For organizing the transformations, static functions are perfectly fine, especially as you are inheriting an existing (working?) project. I would think a little differently for future rules or a project from scratch (you could build a small rules engines where you add the rules to a list and execute at once for example).

Joins

For joining, if you have complex joins you reused, you can imagine isolating them, but your idea of isolating the id columns is good: it is more metadata driven. In some projects, I created a "SuperDataframe" object where I could add some of the missing intelligence to the dataframe. You can also add custom metadata to any columns and have your rules use that: check out https://github.com/jgperrin/net.jgp.labs.spark/blob/master/src/main/java/net/jgp/labs/spark/l090_metadata/l000_add_metadata/AddMetadataApp.java. It looks like:

  Metadata metadata = new MetadataBuilder()
      .putString("x-source", filename)
      .putString("x-format", format)
      .putLong("x-order", i++)
      .build();
  df = df.withColumn(colName, col, metadata);

General advice

Avoid UDFs: they are costly on Catalyst (the optimizer) sees them as blackboxes so it cannot optimize them. Stick to Spark's functions as much as possible.
Stick to dataframes (Dataset<Row>) not Datasets/RDDs for performance optimization.

You're on a good track!

Structuring a Spark Project methodology

Tags:

java

code-cleanup

apache-spark

MisterYUE

1 Answers

Pipeline

Transformations

Joins

General advice

jgp

Recent Activity

Donate For Us

Structuring a Spark Project methodology

Tags:

java

code-cleanup

apache-spark

MisterYUE

1 Answers

Pipeline

Transformations

Joins

General advice

jgp

Related questions

Recent Activity

Donate For Us