Spark Dataset: Filter if value is contained in other dataset

Question

I want to get all links out of a dataset of edges, whose source is contained in a dataset of all existing nodes.

edges columns: | dst | src | type | (all strings)

nodes columns: | id | pageid | (all strings)

I did that by retrieving a list out of the dataset and used the contains() method.

List<String> allNodeList = allNodes.javaRDD().map(r -> r.getString(0)).collect();
Dataset<Row> allLinks = dfEdges.filter("type = 'link'").filter(r -> allNodeList.contains(r.getString(1)));

But now i want to eliminate that additional piece of code and use a more native way. My approach was to use count, but that doesnt seem to work, due to a NotSerializableException.

Dataset<Row> allLinks = dfEdges.filter("type = 'link'").filter(r -> (dfNodes.filter("id="+r.getString(1)).count()>0));

Is there any simple way to solve that problem in java? Ive seen things in scala with "is in" or similar but have no idea how to solve it simply in java.

himanshuIIITian · Accepted Answer

Yes, there is a simple way to solve the problem in java. But only through join. Like this:

Dataset<Row> allLinks = dfEdges.filter("type = 'link'")
                               .join(dfNodes, dfEdges.col("src")
                               .equalTo(dfNodes.col("id")))
                               .drop("dst", "src", "type");

It will give you the desired result.

I hope it helps!

Spark Dataset: Filter if value is contained in other dataset

Tags:

java

apache-spark

apache-spark-sql

apache-spark-dataset

tobias.

1 Answers

himanshuIIITian

Recent Activity

Donate For Us

Spark Dataset: Filter if value is contained in other dataset

Tags:

java

apache-spark

apache-spark-sql

apache-spark-dataset

tobias.

1 Answers

himanshuIIITian

Related questions

Recent Activity

Donate For Us