I have the following pyspark.dataframe:
age  state  name    income
21    DC    john    30-50K
NaN   VA    gerry   20-30K
I'm trying to achieve the equivalent of df.isnull().sum() (from pandas) which produces:
age      1
state    0
name     0
income   0
At first I tried something along the lines of:
null_counter = [df[c].isNotNull().count() for c in df.columns]
but this produces the following error:
TypeError: Column is not iterable
Similarly, this is how I'm currently iterating over columns to get the minimum value:
class BaseAnalyzer:
    def __init__(self, report, struct):
        self.report = report
        self._struct = struct
        self.name = struct.name
        self.data_type = struct.dataType
        self.min = None
        self.max = None
    def __repr__(self):
        return '<Column: %s>' % self.name
class BaseReport:
    def __init__(self, df):
        self.df = df
        self.columns_list = df.columns
        self.columns = {f.name: BaseAnalyzer(self, f) for f in df.schema.fields}
    def calculate_stats(self):
        find_min = self.df.select([fn.min(self.df[c]).alias(c) for c in self.df.columns]).collect()
        min_row = find_min[0]
        for column, min_value in min_row.asDict().items():
            self[column].min = min_value
    def __getitem__(self, name):
        return self.columns[name]
    def __repr__(self):
        return '<Report>'
report = BaseReport(df)
calc = report.calculate_stats()
for column in report1.columns.values():
if hasattr(column, 'min'):
    print("{}:{}".format(column, column.min))
which allows me to 'iterate over the columns'
<Column: age>:1
<Column: name>: Alan
<Column: state>:ALASKA
<Column: income>:0-1k
I think this method has become way to complicated, how can I properly iterate over ALL columns to provide vaiour summary statistcs (min, max, isnull, notnull, etc..)  The distinction between pyspark.sql.Row and pyspark.sql.Column seems strange coming from pandas.
Have you tried something like this:
names = df.schema.names
for name in names:
    print(name + ': ' + df.where(df[name].isNull()).count())
You can see how this could be modified to put the information into a dictionary or some other more useful format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With