Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I get "partition values: [empty row]" log messages when reading a file?

I am using Spark SQL to read in a csv, I also get a lot of such messages:

...some.csv, range: 20971520-24311915, partition values: [empty row]

Why does it say it's empty row? Is the partition real empty?

like image 653
zyxue Avatar asked Nov 29 '17 00:11

zyxue


1 Answers

Neither the file nor the Spark partition with data read from the file is empty.

The log message may be a bit confusing because of two things:

  • The word partition in the message refers to a Hive-style partition, i.e. a named partition column that can have multiple values. Such partitions can be inferred from your directory structure, e.g. for /path/to/partition/a=1/b=hello/c=3.14 they would be a, b and c, and their values: 1, hello and 3.14. They can also come from the Hive Metastore in case of partitioned external tables.
  • The partition values logged are wrapped in an InternalRow, not in a collection.

In your case, the directory structure is flat or it does not contain partition names (e.g. /path/to/partition/1/hello/3.14), so there are no Hive-style partitions and you see [empty row] in the message as a result.

like image 121
Piotr Góralczyk Avatar answered Nov 16 '22 01:11

Piotr Góralczyk