Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataset: Filter if value is contained in other dataset

I want to get all links out of a dataset of edges, whose source is contained in a dataset of all existing nodes.

edges columns: | dst | src | type | (all strings)

nodes columns: | id | pageid | (all strings)

I did that by retrieving a list out of the dataset and used the contains() method.

List<String> allNodeList = allNodes.javaRDD().map(r -> r.getString(0)).collect();
Dataset<Row> allLinks = dfEdges.filter("type = 'link'").filter(r -> allNodeList.contains(r.getString(1)));

But now i want to eliminate that additional piece of code and use a more native way. My approach was to use count, but that doesnt seem to work, due to a NotSerializableException.

Dataset<Row> allLinks = dfEdges.filter("type = 'link'").filter(r -> (dfNodes.filter("id="+r.getString(1)).count()>0));

Is there any simple way to solve that problem in java? Ive seen things in scala with "is in" or similar but have no idea how to solve it simply in java.

like image 288
tobias. Avatar asked Mar 08 '23 10:03

tobias.


1 Answers

Yes, there is a simple way to solve the problem in java. But only through join. Like this:

Dataset<Row> allLinks = dfEdges.filter("type = 'link'")
                               .join(dfNodes, dfEdges.col("src")
                               .equalTo(dfNodes.col("id")))
                               .drop("dst", "src", "type");

It will give you the desired result.

I hope it helps!

like image 63
himanshuIIITian Avatar answered Apr 30 '23 14:04

himanshuIIITian