I'm going through the Spark: The Definitive Guide book from O'Reilly and I'm running into an error when I try to do a simple DataFrame operation.
The data is like:
DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
...
I then read it with (in Pyspark):
flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("./data/flight-data/csv/2015-summary.csv")
Then I try to run the following command:
flightData2015.select(max("count")).take(1)
I get the following error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`u`' given input columns: [DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count];;
'Project ['u]
+- AnalysisBarrier
+- Relation[DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] csv"
I don't know where "u" is even coming from, since it's not in my code and it isn't in the data file header either. I read another suggestion that this could be caused by spaces in the header, but that's not applicable here. Any idea what to try?
NOTE: The strange thing is, the same thing works when I use SQL instead of the DataFrame transformations. This works:
flightData2015.createOrReplaceTempView("flight_data_2015")
spark.sql("SELECT max(count) from flight_data_2015").take(1)
I can also do the following and it works fine:
flightData2015.show()
Your issue is that you are calling the built-in max
function, not pyspark.sql.functions.max
.
When python evaluates max("count")
in your code it returns the letter 'u'
, which is the maximum value in the collection of letters that make up the string.
print(max("count"))
#'u'
Try this instead:
import pyspark.sql.functions as f
flightData2015.select(f.max("count")).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With