apache-spark tutorials and guides

How to reference a dataframe when in an UDF on another dataframe?

Aug 26, 2022

NullPointerException in org.apache.spark.ml.feature.Tokenizer

Mar 27, 2021

scala apache-spark machine-learning

How to use Scala UDF in PySpark?

Nov 16, 2022

python scala apache-spark pyspark apache-spark-sql

Scala/Spark dataframes: find the column name corresponding to the max

Nov 16, 2022

scala apache-spark dataframe apache-spark-sql argmax

Apache Spark how to append new column from list/array to Spark dataframe

Jun 14, 2022

scala apache-spark dataframe apache-spark-sql

Pyspark: Is there an equivalent method to pandas info()?

Jan 02, 2021

python pandas apache-spark pyspark

Getting last value of group in Spark

Nov 10, 2018

apache-spark pyspark spark-dataframe sparkr

How to read streaming data in XML format from Kafka?

Aug 24, 2022

apache-spark xml-parsing pyspark-sql spark-structured-streaming

How to flatten columns of type array of structs (as returned by Spark ML API)?

Aug 10, 2022

apache-spark apache-spark-sql apache-spark-ml

Splitting a column in pyspark

Nov 20, 2022

python apache-spark pyspark

Spark: Return empty column if column does not exist in dataframe

Nov 06, 2022

apache-spark pyspark apache-spark-sql pyspark-sql

Apache Spark startsWith in SQL expression

Sep 07, 2022

scala apache-spark apache-spark-sql

Spark AnalysisException when "flattening" DataFrame in Spark SQL

Aug 25, 2022

apache-spark apache-spark-sql

Pyspark - Cumulative sum with reset condition

Jun 24, 2022

python dataframe apache-spark pyspark cumulative-sum

How to find the max value of multiple columns?

Nov 07, 2022

scala apache-spark apache-spark-sql

How to set up Zeppelin to work with remote EMR Yarn cluster

Aug 29, 2022

apache-spark hadoop-yarn emr apache-zeppelin

Spark Convert Data Frame Column to dense Vector for StandardScaler() "Column must be of type org.apache.spark.ml.linalg.VectorUDT"

Mar 09, 2022

python apache-spark pyspark apache-spark-sql apache-spark-ml

Java Apache Spark: Long transformation chains result in quadratic time

May 15, 2019

java apache-spark

Pyspark Dataframe Join using UDF

Feb 07, 2022

python apache-spark pyspark apache-spark-sql user-defined-functions

set spark.streaming.kafka.maxRatePerPartition for createDirectStream

Sep 16, 2022

apache-spark spark-streaming

New posts in apache-spark