Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Structuring a Spark Project methodology

I'm currently in my first job that involves Spark. Code is written in Java.

My current task is to continue what was left off by a previous dev.

The task in itself isn't very hard but context:

  • There are 10+ datasets that come from different sources (files, company's endpoint, tables, etc...)
  • These datasets have to be joined at the end into a unique dataset.
  • There are hundreds of "Rules" on those datasets.
  • Some of these rules can only be applied by specific joins from DsA + DsB for example.

The previous Dev wrote the Rules, the joins in only one specific class (which are getting really big, like thousand liners) with a huge main class too.

I can easily follow the same process and make those classes even bigger but I find it extremely irritating to read and maintain for the future.

For now, I restructured the project in packages like:

  • read
  • preprocess
  • calculations
  • write

But I also doubt my own methodology and I have no Senior to rely on in the company.

  • My classes are just utility classes with many (but organized ?) static methods that contain the Rules and I call from a helper class to chain all the processes. I feel like it is kind off "dirty" code but I'm unable to come up with any other solution

  • I do have a few classes in which I inject a dataset as a parameter so that this dataset serves as a reference in all the methods inside the class. But all the methods are really "specific" for example:

    public  class  CurrencyService {    
      private final Dataset<Row> currencyReference;
    
      // constructor
      public Dataset<Row> applyCurrencyConversion(Dataset<Row> targetDataset){
        // target joins reference and apply conversion
      }
    
      private Column conversion(Column targetColumn){
        // conversion rules
      }
    } 
    
  • How should I handle Join Keys? Every dataset has a different name for its ID. I was thinking about forcing a key String from a constant for each ID at preprocessing step. Because otherwise, I have to join every dataset manually by knowing their Column's name each time. With preprocessed join keys, I may do something like dsA.join(dsB, Sequence of (joinKeys)).

In the end, I feel like I'm spaghettying the code even more while the previous "single class" style was easier to understand but not to maintain...

I have a background in POO, so am I taking this project too Object-Oriented?

like image 258
MisterYUE Avatar asked May 18 '26 19:05

MisterYUE


1 Answers

This is a really very broad and impressive question :) - I wish developers have this level of maturity when they start a project.

Pipeline

You are building a pipeline, I recommend you follow:

  1. Ingest (read) the data
  2. Apply data quality rules (make sure the data is ok)
  3. Transform the data (shape it to the way you want)
  4. Perform the join
  5. Save the data as a checkpoint/final result

Big Data pipeline

If you follow this diagram, you should not apply rules after the Gold zone, unless you work on specific sub-projects.

Transformations

For organizing the transformations, static functions are perfectly fine, especially as you are inheriting an existing (working?) project. I would think a little differently for future rules or a project from scratch (you could build a small rules engines where you add the rules to a list and execute at once for example).

Joins

For joining, if you have complex joins you reused, you can imagine isolating them, but your idea of isolating the id columns is good: it is more metadata driven. In some projects, I created a "SuperDataframe" object where I could add some of the missing intelligence to the dataframe. You can also add custom metadata to any columns and have your rules use that: check out https://github.com/jgperrin/net.jgp.labs.spark/blob/master/src/main/java/net/jgp/labs/spark/l090_metadata/l000_add_metadata/AddMetadataApp.java. It looks like:

  Metadata metadata = new MetadataBuilder()
      .putString("x-source", filename)
      .putString("x-format", format)
      .putLong("x-order", i++)
      .build();
  df = df.withColumn(colName, col, metadata);

General advice

  1. Avoid UDFs: they are costly on Catalyst (the optimizer) sees them as blackboxes so it cannot optimize them. Stick to Spark's functions as much as possible.
  2. Stick to dataframes (Dataset<Row>) not Datasets/RDDs for performance optimization.

You're on a good track!

like image 69
jgp Avatar answered May 20 '26 08:05

jgp