Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are spark dataframes distributed?

I'm converting a batch operation to a spark job with the intention of running in in AWS EMR; the core of this job is a join between two reasonably large data sets.

The core of the operation is a join:

table_1: loaded from json file_1
table_2: loaded from parquet file_2
joined_table = table_1.join(table_2)
   .map(some_data_transformations)

store_it_off(joined_table)

From the google definitions, a dataFrame is a tabular structure, and an Rdd is distributed; however, I've seen other notes that dataFrames are implemented based on Rdds. Are dataframes distributed? Are they distributed only after certain steps to parallelize them?

like image 598
LizH Avatar asked Mar 18 '26 22:03

LizH


1 Answers

yes Spark dataFrames are distributed
from spark the difinitive guide :

..spark dataFrame can span thousands of computers.

however this is only available in scala and java,
from the same book :

... Python/R DataFrames exist on one machine rather than multiple machines

like image 144
aName Avatar answered Mar 21 '26 12:03

aName