Pyspark: spark-submit not working like CLI

Tags:

pyspark

I have a pyspark to load data from a TSV file and save it as parquet file as well save it as a persistent SQL table.

When I run it line by line through pyspark CLI, it works exactly like expected. When I run it as as an application using spark-submit it runs without any errors but I get strange results: 1. the data is overwritten instead of appended. 2. When I run SQL queries against it I get no data returned even though the parquet files are several gigabytes in size (what I expect). Any suggestions?

Code:

from pyspark import SparkContext, SparkConf
from pyspark.sql.types import *
from pyspark.sql.functions import *

csv_file = '/srv/spark/data/input/ipfixminute2018-03-28.tsv'
parquet_dir = '/srv/spark/data/parquet/ipfixminute'

sc = SparkContext(appName='import-ipfixminute')
spark = SQLContext(sc)

fields = [StructField('time_stamp', TimestampType(), True),
                StructField('subscriberId', StringType(), True),
                StructField('sourceIPv4Address', StringType(), True),
                StructField('destinationIPv4Address', StringType(), True),
                StructField('service',StringType(), True),
                StructField('baseService',StringType(), True),
                StructField('serverHostname', StringType(), True),
                StructField('rat', StringType(), True),
                StructField('userAgent', StringType(), True),
                StructField('accessPoint', StringType(), True),
                StructField('station', StringType(), True),
                StructField('device', StringType(), True),
                StructField('contentCategories', StringType(), True),
                StructField('incomingOctets', LongType(), True),
                StructField('outgoingOctets', LongType(), True),
                StructField('incomingShapingDrops', IntegerType(), True),
                StructField('outgoingShapingDrops', IntegerType(), True),
                StructField('qoeIncomingInternal', DoubleType(), True),
                StructField('qoeIncomingExternal', DoubleType(), True),
                StructField('qoeOutgoingInternal', DoubleType(), True),
                StructField('qoeOutgoingExternal', DoubleType(), True),
                StructField('incomingShapingLatency', DoubleType(), True),
                StructField('outgoingShapingLatency', DoubleType(), True),
                StructField('internalRtt', DoubleType(), True),
                StructField('externalRtt', DoubleType(), True),
                StructField('HttpUrl',StringType(), True)]

schema = StructType(fields)
df = spark.read.load(csv_file, format='csv',sep='\t',header=True,schema=schema,timestampFormat='yyyy-MM-dd HH:mm:ss')
df = df.drop('all')
df = df.withColumn('date',to_date('time_stamp'))
df.write.saveAsTable('test2',mode='append',partitionBy='date',path=parquet_dir)

794

asked May 22 '18 14:05

Mikhail Venkov

1 Answers

As @user8371915 suggested it is similar to this:

Spark can access Hive table from pyspark but not from spark-submit

I needed to replace

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

with

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)

This resolved this issue.

172

answered Oct 04 '22 17:10

Mikhail Venkov

Related questions
                            
                                Create Custom Cross Validation in Spark ML
                            
                                Spark Connector error: WARN NettyUtil: Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead
                            
                                Why won't this Spark sample code load in spark-shell?
                            
                                too many map keys causing out of memory exception in spark
                            
                                How to improve my recommendation result? I am using spark ALS implicit
                            
                                How to serialize a pyspark Pipeline object?
                            
                                Can I create an RDD from a kafka topic if I do not know the until offset?
                            
                                How to Set spark.sql.parquet.output.committer.class in pyspark
                            
                                Performance of loading parquet files into case classes in Spark
                            
                                PySpark how to read file having string with multiple encoding
                            
                                Why does SparkSQL require two literal escape backslashes in the SQL query?
                            
                                Timestamp roundtrip from Spark Python to Pandas and back
                            
                                Load a file from SFTP server into spark RDD
                            
                                Structured Streaming - Foreach Sink
                            
                                Read data from remote hive on spark over JDBC returns empty result
                            
                                Why can't I display prediction column of Spark MultilayerPerceptronClassifier?
                            
                                How to add hbase-site.xml config file using spark-shell
                            
                                Re-run Spark jobs on Failure or Abort
                            
                                How do I use Spark ORC indexes?
                            
                                Get a registered Spark Accumulator by name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: spark-submit not working like CLI

Tags:

apache-spark

pyspark

Mikhail Venkov

People also ask

1 Answers

Mikhail Venkov

Recent Activity

Donate For Us