Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I select a stable subset of rows from a Spark DataFrame?

I've loaded a file into a DataFrame in Zeppelin notebooks like this:

val df = spark.read.format("com.databricks.spark.csv").load("some_file").toDF("c1", "c2", "c3")

This DataFrame has >10 million rows, and I would like to start work with just a subset of the rows, so I use limit:

val df_small = df.limit(1000)

However, now when I try to filter the DataFrame on the string value of one of the columns, I get different results every time I run the following:

df_small.filter($"c1" LIKE "something").show()

How can I take a subset of df that remains stable for every filter I run?

like image 778
Karmen Avatar asked Aug 11 '17 07:08

Karmen


1 Answers

Spark works as a lazy load so only at statement .show above 2 statements will execute. you can write df_small to a file and read that alone everytime or do df_small.cache()

like image 175
toofrellik Avatar answered Sep 26 '22 14:09

toofrellik