How can I select a stable subset of rows from a Spark DataFrame?

Question

I've loaded a file into a DataFrame in Zeppelin notebooks like this:

val df = spark.read.format("com.databricks.spark.csv").load("some_file").toDF("c1", "c2", "c3")

This DataFrame has >10 million rows, and I would like to start work with just a subset of the rows, so I use limit:

val df_small = df.limit(1000)

However, now when I try to filter the DataFrame on the string value of one of the columns, I get different results every time I run the following:

df_small.filter($"c1" LIKE "something").show()

How can I take a subset of df that remains stable for every filter I run?

toofrellik · Accepted Answer

Spark works as a lazy load so only at statement .show above 2 statements will execute. you can write df_small to a file and read that alone everytime or do df_small.cache()

How can I select a stable subset of rows from a Spark DataFrame?

Tags:

scala

spark-dataframe

apache-zeppelin

Karmen

1 Answers

toofrellik

Recent Activity

Donate For Us

How can I select a stable subset of rows from a Spark DataFrame?

Tags:

scala

spark-dataframe

apache-zeppelin

Karmen

1 Answers

toofrellik

Related questions

Recent Activity

Donate For Us