Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by

from pyspark.sql.functions import col 

but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?

like image 389
Bamqf Avatar asked Oct 20 '16 19:10

Bamqf


People also ask

How do I add a column to a spark in DataFrame?

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft().

How do you use isNULL in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame.


1 Answers

It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods.

If you carefully check the source you'll find col listed among other _functions. This dictionary is further iterated and _create_function is used to generate wrappers. Each generated function is directly assigned to a corresponding name in the globals.

Finally __all__, which defines a list of items exported from the module, just exports all globals excluding ones contained in the blacklist.

If this mechanisms is still not clear you can create a toy example:

  • Create Python module called foo.py with a following content:

    # Creates a function assigned to the name foo globals()["foo"] = lambda x: "foo {0}".format(x)  # Exports all entries from globals which start with foo __all__ = [x for x in globals() if x.startswith("foo")] 
  • Place it somewhere on the Python path (for example in the working directory).

  • Import foo:

    from foo import foo  foo(1) 

An undesired side effect of such metaprogramming approach is that defined functions might not be recognized by the tools depending purely on static code analysis. This is not a critical issue and can be safely ignored during development process.

Depending on the IDE installing type annotations might resolve the problem (see for example zero323/pyspark-stubs#172).

like image 115
zero323 Avatar answered Sep 23 '22 05:09

zero323