Is it better for Spark to select from hive or select from file

Question

I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?

Mike

James Tobin · Accepted Answer

tl;dr : I would read it straight from the parquet files

I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

dffile count --> 0.38s; dfhive count --> 8.99s

dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s

dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s

dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s

Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms

Is it better for Spark to select from hive or select from file

Tags:

apache-spark

hive

parquet

spark-dataframe

flat-file

uh_big_mike_boi

1 Answers

James Tobin

Recent Activity

Donate For Us

Is it better for Spark to select from hive or select from file

Tags:

apache-spark

hive

parquet

spark-dataframe

flat-file

uh_big_mike_boi

1 Answers

James Tobin

Related questions

Recent Activity

Donate For Us