Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select columns that satisfy a condition

I'm running the following notebook in zeppelin:

%spark.pyspark
l = [('user1', 33, 1.0, 'chess'), ('user2', 34, 2.0, 'tenis'), ('user3', None, None, ''), ('user4', None, 4.0, '   '), ('user5', None, 5.0, 'ski')]
df = spark.createDataFrame(l, ['name', 'age', 'ratio', 'hobby'])
df.show()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- ratio: double (nullable = true)
 |-- hobby: string (nullable = true)
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1|  33|  1.0|chess|
|user2|  34|  2.0|tenis|
|user3|null| null|     |
|user4|null|  4.0|     |
|user5|null|  5.0|  ski|
+-----+----+-----+-----+

agg_df = df.select(*[(1.0 - (count(c) / count('*'))).alias(c) for c in df.columns])
agg_df.show()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- ratio: double (nullable = true)
 |-- hobby: string (nullable = true)
+----+---+-------------------+-----+
|name|age|              ratio|hobby|
+----+---+-------------------+-----+
| 0.0|0.6|0.19999999999999996|  0.0|
+----+---+-------------------+-----+

Now, I want to select in agg_df only columns which value is < 0.35. In this case it should return ['name', 'ratio', 'hobby']

I can't figure out how to do it. Any hint?

like image 965
Sofiane Cherchalli Avatar asked May 22 '17 12:05

Sofiane Cherchalli


People also ask

How do I get only certain columns in Pandas?

To select a single column, use square brackets [] with the column name of the column of interest.


Video Answer


1 Answers

you mean values < 0.35?. This should do

>>> [ key for (key,value) in agg_df.collect()[0].asDict().items() if value < 0.35  ]
['hobby', 'ratio', 'name']

to replace blank string with Null use the following udf function.

from pyspark.sql.functions import udf
process = udf(lambda x: None if not x else (x if x.strip() else None))
df.withColumn('hobby', process(df.hobby)).show()
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1|  33|  1.0|chess|
|user2|  34|  2.0|tenis|
|user3|null| null| null|
|user4|null|  4.0| null|
|user5|null|  5.0|  ski|
+-----+----+-----+-----+
like image 107
rogue-one Avatar answered Oct 02 '22 03:10

rogue-one