Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python function such as max() doesn't work in pyspark application

Tags:

python

pyspark

Python function max(3,6) works under pyspark shell. But if it is put in an application and submit, it will throw an error: TypeError: _() takes exactly 1 argument (2 given)

like image 522
user3610141 Avatar asked Jan 07 '23 06:01

user3610141


1 Answers

It looks like you have an import conflict in your application most likely due to wildcard import from pyspark.sql.functions:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.10 (default, Oct 19 2015 18:04:42)
SparkContext available as sc, HiveContext available as sqlContext.

In [1]: max(1, 2)
Out[1]: 2

In [2]: from pyspark.sql.functions import max

In [3]: max(1, 2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-bb133f5d83e9> in <module>()
----> 1 max(1, 2)

TypeError: _() takes exactly 1 argument (2 given)

Unless you work in a relatively limited it is best to either perfix:

from pyspark.sql import functions as sqlf

max(1, 2)
## 2

sqlf.max("foo")
## Column<max(foo)>

or alias:

from pyspark.sql.functions import max as max_

max(1, 2)
## 2

max_("foo")
## Column<max(foo)>
like image 91
zero323 Avatar answered Feb 01 '23 14:02

zero323