Problem: I want to import data into Spark EMR from S3 using: <pre class="prettyprint"><code>data = sqlContext.read.json("s3n://.....") </code></pre> Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the data: <pre class="prettyprint"><code>data.registerTempTable("table") SqlData = sqlContext.sql("SELECT * FROM table") </code></pre> Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and <code>sc.parallelize</code> I can pass the number of partitions as an input. Also, I have seen <code>repartition()</code>, but I am not sure if it can solve my problem. The variable <code>data</code> is a <code>DataFrame</code> in my example. Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value

By default it partitions into 200 sets. You can change it by using set command in sql context <code>sqlContext.sql("set spark.sql.shuffle.partitions=10");</code>. However you need to set it with caution based up on your data characteristics.

How to set the number of partitions/nodes when importing data into Spark

Tags:

sql

apache-spark

pyspark-sql

database-partitioning

Problem: I want to import data into Spark EMR from S3 using:

data = sqlContext.read.json("s3n://.....")

Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the data:

data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")

Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelize I can pass the number of partitions as an input. Also, I have seen repartition(), but I am not sure if it can solve my problem. The variable data is a DataFrame in my example.

Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value

448

asked Jan 04 '16 18:01

pemfir

2 Answers

By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");. However you need to set it with caution based up on your data characteristics.

120

answered Oct 04 '22 19:10

Durga Viswanath Gadiraju

You can call repartition() on dataframe for setting partitions. You can even set spark.sql.shuffle.partitions this property after creating hive context or by passing to spark-submit jar:

spark-submit .... --conf spark.sql.shuffle.partitions=100

dataframe.repartition(100)

answered Oct 04 '22 17:10

Raju Bairishetti

Related questions
                            
                                SQL SERVER DATETIME FORMAT
                            
                                An error appear when I use "DROP TABLE IF EXISTS" in SSMS v.17.1
                            
                                Anything better than P6Spy? [closed]
                            
                                Round date to 10 minutes interval
                            
                                When should I nest PL/SQL BEGIN...END blocks?
                            
                                Is there a dynamic Sql builder library for .Net? [closed]
                            
                                pysqlite: Placeholder substitution for column or table names?
                            
                                Difference between javax.sql and java.sql?
                            
                                An inner join with SqlAlchemy
                            
                                limit vs exists vs count(*) vs count(id) in MySQL [duplicate]
                            
                                local function inside PL/SQL script
                            
                                Cross Apply Including Nulls
                            
                                MySQL's HEX() and UNHEX() equivalent in Postgres?
                            
                                Sql Table data type for email address?
                            
                                SQL Server - Invalid characters in parameter names
                            
                                Update ANSI_NULLS option in an existing table
                            
                                Return rows between a specific range, with one select statement
                            
                                SQL vs NoSQL: what about other issues than ACID and scalibility?
                            
                                is it possible to create trigger without execute procedure in postgresql?
                            
                                jOOQ - support for JSR310

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With