From official documentation for Apache Spark:
http://spark.apache.org/docs/latest/rdd-programming-guide.html
map(func):Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.
Going by bold words, is it a big difference?And is it really a difference?
It's really just a difference from the end-user in how you use the API. map is meant to take a record as input and return a record that you've applied some function to. Whereas filter is meant to take a record as input and return a boolean. Internally Spark will execute both with mapPartitions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With