I have the following pyspark.dataframe
:
age state name income
21 DC john 30-50K
NaN VA gerry 20-30K
I'm trying to achieve the equivalent of df.isnull().sum()
(from pandas) which produces:
age 1
state 0
name 0
income 0
At first I tried something along the lines of:
null_counter = [df[c].isNotNull().count() for c in df.columns]
but this produces the following error:
TypeError: Column is not iterable
Similarly, this is how I'm currently iterating over columns to get the minimum value:
class BaseAnalyzer:
def __init__(self, report, struct):
self.report = report
self._struct = struct
self.name = struct.name
self.data_type = struct.dataType
self.min = None
self.max = None
def __repr__(self):
return '<Column: %s>' % self.name
class BaseReport:
def __init__(self, df):
self.df = df
self.columns_list = df.columns
self.columns = {f.name: BaseAnalyzer(self, f) for f in df.schema.fields}
def calculate_stats(self):
find_min = self.df.select([fn.min(self.df[c]).alias(c) for c in self.df.columns]).collect()
min_row = find_min[0]
for column, min_value in min_row.asDict().items():
self[column].min = min_value
def __getitem__(self, name):
return self.columns[name]
def __repr__(self):
return '<Report>'
report = BaseReport(df)
calc = report.calculate_stats()
for column in report1.columns.values():
if hasattr(column, 'min'):
print("{}:{}".format(column, column.min))
which allows me to 'iterate over the columns'
<Column: age>:1
<Column: name>: Alan
<Column: state>:ALASKA
<Column: income>:0-1k
I think this method has become way to complicated, how can I properly iterate over ALL columns to provide vaiour summary statistcs (min, max, isnull, notnull, etc..) The distinction between pyspark.sql.Row
and pyspark.sql.Column
seems strange coming from pandas.
Have you tried something like this:
names = df.schema.names
for name in names:
print(name + ': ' + df.where(df[name].isNull()).count())
You can see how this could be modified to put the information into a dictionary or some other more useful format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With