how to print out snippets of a RDD in the spark-shell / pyspark?

Tags:

apache-spark

pyspark

When working in the spark-shell, I frequently want to inspect RDDs (similar to using head in unix).

For example:

scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> // how to inspect the readmeFile?

and ...

Click to copy

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> // how to inspect linesContainingSpark?

450

asked Jun 29 '15 12:06

Chris Snow

1 Answers

I found out how to do this (here) and thought this will be useful for other users, so sharing here. take(x) selects the first x items and foreach prints them:

Click to copy

scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> readmeFile.take(5).foreach(println)
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

and ...

Click to copy

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> linesContainingSpark.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming

The examples below are the equivalent but using pyspark:

Click to copy

>>> readmeFile = sc.textFile("input/tmp/README.md")
>>> for line in readmeFile.take(5): print line
... 
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

and

Click to copy

>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line)
>>> for line in linesContainingSpark.take(5): print line
... 
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming

164

answered Sep 18 '22 17:09

Chris Snow

Related questions
                            
                                Save a spark RDD to the local file system using Java
                            
                                Why does Spark/Scala compiler fail to find toDF on RDD[Map[Int, Int]]?
                            
                                What do WARN messages mean when starting spark-shell?
                            
                                Spark + Scala transformations, immutability & memory consumption overheads
                            
                                pyspark row number dataframe
                            
                                How to register byte[][] using kryo serialization for spark
                            
                                Error in Spark while declaring a UDF
                            
                                Changing Nulls Ordering in Spark SQL
                            
                                Use more than one collect_list in one query in Spark SQL
                            
                                How to convert an RDD of Maps to dataframe
                            
                                How to write into PostgreSQL hstore using Spark Dataset
                            
                                How to access Spark Web UI?
                            
                                Reading CSV file in Spark in a distributed manner
                            
                                Reading Avro File in Spark
                            
                                Running Spark driver program in Docker container - no connection back from executor to the driver?
                            
                                Drop if all entries in a spark dataframe's specific column is null
                            
                                How to add a column to the beginning of the schema?
                            
                                spark [dataframe].write.option("mode","overwrite").saveAsTable("foo") fails with 'already exists' if foo exists
                            
                                how to use jni in spark?
                            
                                saveTocassandra could not find implicit value for parameter rwf

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to print out snippets of a RDD in the spark-shell / pyspark?

Tags:

apache-spark

pyspark

Chris Snow

People also ask

1 Answers

Chris Snow

Recent Activity

Donate For Us