Saving as Text in Spark 1.30 using Dataframes in Scala

Tags:

I am using Spark version 1.3.0 and using dataframes with SparkSQL in Scala. In version 1.2.0 there was a method called "saveAsText". In version 1.3.0 using dataframes there is only a "save" method. The default output is parquet.
How can I specify the output should be TEXT using the save method ?

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

teenagers.save("/user/me/out")

728

asked Mar 27 '15 14:03

jeffrey podolsky

Video Answer

2 Answers

You can use this:

teenagers.rdd.saveAsTextFile("/user/me/out")

answered Oct 27 '22 01:10

ngtrkhoa

First off, you should consider if you really need to save the data frame as text. Because DataFrame holds the data by columns (and not by rows as rdd), .rdd operation is costly, because the data need to be reprocessed for that. parquet is a columnar format and is much more efficient to be used.

That being said, sometimes you really do need to save as a text file.

As far as I know DataFrame out of the box won't let you save as text file. If you look at the source code, you'll see that 4 formats are supported:

jdbc
json
parquet
orc

so your options are either using df.rdd.saveAsTextFile as suggested before, or to use spark-csv, that will allow you to do something like:

Spark 1.4+:

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars.csv")

Spark 1.3:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv")

with the added value of handling the annoying parts of quoting and escaping of the strings

answered Oct 26 '22 23:10

lev

Related questions
                            
                                Small footprint embedded Java SQL database
                            
                                Oracle: Similar to sysdate but returning only time and only date
                            
                                Oracle SQL: getting only one max row using multiple criteria
                            
                                Difference between Select Into and Insert Into from old table?
                            
                                Adding a Constant Row Result to SQL Query - MS Access
                            
                                Disabling MySQL Strict Mode
                            
                                How to create a PivotTable in Transact/SQL?
                            
                                Convert hex string to bigint in Postgres [duplicate]
                            
                                SQL - monthly average rather than daily average
                            
                                I'm getting this error code: Invalid argument supplied for foreach() [duplicate]
                            
                                How to update a column via Row_Number with a different value for each row?
                            
                                SQL getting items not in a list
                            
                                sql server store the result of a select in a variable?
                            
                                How to check if a column has not null constraint?
                            
                                How to use ternary operator in SQL Server 2008?
                            
                                Select distinct value count laravel
                            
                                JPA/Hibernate select query returning duplicate records
                            
                                How do we check version of Oracle
                            
                                There is a way to generate a list of GUID's using NEWID function?
                            
                                Android Sqlite: Check if row exists in table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Saving as Text in Spark 1.30 using Dataframes in Scala

Tags:

sql

scala

apache-spark

jeffrey podolsky

People also ask

Video Answer

2 Answers

ngtrkhoa

lev

Recent Activity

Donate For Us