Select columns that satisfy a condition

Q: How do I get only certain columns in Pandas?

To select a single column, use square brackets [] with the column name of the column of interest.

Tags:

apache-spark

pyspark

pyspark-sql

spark-dataframe

apache-zeppelin

I'm running the following notebook in zeppelin:

%spark.pyspark
l = [('user1', 33, 1.0, 'chess'), ('user2', 34, 2.0, 'tenis'), ('user3', None, None, ''), ('user4', None, 4.0, '   '), ('user5', None, 5.0, 'ski')]
df = spark.createDataFrame(l, ['name', 'age', 'ratio', 'hobby'])
df.show()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- ratio: double (nullable = true)
 |-- hobby: string (nullable = true)
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1|  33|  1.0|chess|
|user2|  34|  2.0|tenis|
|user3|null| null|     |
|user4|null|  4.0|     |
|user5|null|  5.0|  ski|
+-----+----+-----+-----+

agg_df = df.select(*[(1.0 - (count(c) / count('*'))).alias(c) for c in df.columns])
agg_df.show()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- ratio: double (nullable = true)
 |-- hobby: string (nullable = true)
+----+---+-------------------+-----+
|name|age|              ratio|hobby|
+----+---+-------------------+-----+
| 0.0|0.6|0.19999999999999996|  0.0|
+----+---+-------------------+-----+

Now, I want to select in agg_df only columns which value is < 0.35. In this case it should return ['name', 'ratio', 'hobby']

I can't figure out how to do it. Any hint?

965

asked May 22 '17 12:05

Sofiane Cherchalli

Video Answer

1 Answers

you mean values < 0.35?. This should do

>>> [ key for (key,value) in agg_df.collect()[0].asDict().items() if value < 0.35  ]
['hobby', 'ratio', 'name']

to replace blank string with Null use the following udf function.

from pyspark.sql.functions import udf
process = udf(lambda x: None if not x else (x if x.strip() else None))
df.withColumn('hobby', process(df.hobby)).show()
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1|  33|  1.0|chess|
|user2|  34|  2.0|tenis|
|user3|null| null| null|
|user4|null|  4.0| null|
|user5|null|  5.0|  ski|
+-----+----+-----+-----+

107

answered Oct 02 '22 03:10

rogue-one

Related questions
                            
                                get size of parquet file in HDFS for repartition with Spark in Scala
                            
                                Spark on Java - What is the right way to have a static object on all workers
                            
                                DataFrame explode list of JSON objects
                            
                                EMR spark-shell not picking up jars
                            
                                What happens if the data can't fit in memory with cache() in Spark?
                            
                                Memory issue when importing parquet files in Spark
                            
                                Is it possible to obtain specific message offset in Kafka+SparkStreaming?
                            
                                OneHotEncoder in Spark Dataframe in Pipeline
                            
                                How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics
                            
                                Spark on YARN too less vcores used
                            
                                Java FlatMapFunction in Spark: error: is not abstract and does not override abstract method call(String) in FlatMapFunction
                            
                                How to use User Defined Types in Spark 2.0?
                            
                                How to create encoder for custom Java objects?
                            
                                How to partition Spark RDD when importing Postgres using JDBC?
                            
                                Using typesafe config with Spark on Yarn
                            
                                How to avoid boxing bytes in array in custom datasource?
                            
                                Spark: grouping rows in array by key
                            
                                Converting mysql table to spark dataset is very slow compared to same from csv file
                            
                                Pyspark: cast array with nested struct to string
                            
                                Modify spark DataFrame column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With