Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

can't resolve ... given input columns

I'm going through the Spark: The Definitive Guide book from O'Reilly and I'm running into an error when I try to do a simple DataFrame operation.

The data is like:

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
...

I then read it with (in Pyspark):

flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("./data/flight-data/csv/2015-summary.csv")

Then I try to run the following command:

flightData2015.select(max("count")).take(1)

I get the following error:

pyspark.sql.utils.AnalysisException: "cannot resolve '`u`' given input columns: [DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count];;
'Project ['u]
+- AnalysisBarrier
      +- Relation[DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] csv"

I don't know where "u" is even coming from, since it's not in my code and it isn't in the data file header either. I read another suggestion that this could be caused by spaces in the header, but that's not applicable here. Any idea what to try?

NOTE: The strange thing is, the same thing works when I use SQL instead of the DataFrame transformations. This works:

flightData2015.createOrReplaceTempView("flight_data_2015")
spark.sql("SELECT max(count) from flight_data_2015").take(1)

I can also do the following and it works fine:

flightData2015.show()
like image 984
Stephen Avatar asked Aug 09 '18 00:08

Stephen


1 Answers

Your issue is that you are calling the built-in max function, not pyspark.sql.functions.max.

When python evaluates max("count") in your code it returns the letter 'u', which is the maximum value in the collection of letters that make up the string.

print(max("count"))
#'u'

Try this instead:

import pyspark.sql.functions as f
flightData2015.select(f.max("count")).show()
like image 139
pault Avatar answered Nov 08 '22 08:11

pault