Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
Python3. Method 2: Using dropDuplicates() method. The dropDuplicates() used to remove rows that have the same values on multiple selected columns.
In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.
This should help to get distinct values of a column:
df.select('column1').distinct().collect()
Note that .collect()
doesn't have any built-in limit on how many values can return so this might be slow -- use .show()
instead or add .limit(20)
before .collect()
to manage this.
Let's assume we're working with the following representation of data (two columns, k
and v
, where k
contains three entries, two unique:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
With a Pandas dataframe:
import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()
This returns an ndarray
, i.e. array(['foo', 'bar'], dtype=object)
You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:
s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))
If you want the same result from Spark, i.e. an ndarray
, use toPandas()
:
s_df.toPandas()['k'].unique()
Alternatively, if you don't need an ndarray
specifically and just want a list of the unique values of column k
:
s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()
Finally, you can also use a list comprehension as follows:
[i.k for i in s_df.select('k').distinct().collect()]
You can use df.dropDuplicates(['col1','col2'])
to get only distinct rows based on colX in the array.
collect_set can help to get unique values from a given column of pyspark.sql.DataFrame
df.select(F.collect_set("column").alias("column")).first()["column"]
If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. It would show the 100 distinct values (if 100 values are available) for the colname
column in the df
dataframe.
df.select('colname').distinct().show(100, False)
If you want to do something fancy on the distinct values, you can save the distinct values in a vector:
a = df.select('colname').distinct()
you could do
distinct_column = 'somecol'
distinct_column_vals = df.select(distinct_column).distinct().collect()
distinct_column_vals = [v[distinct_column] for v in distinct_column_vals]
In addition to the dropDuplicates
option there is the method named as we know it in pandas
drop_duplicates
:
drop_duplicates() is an alias for dropDuplicates().
Example
s_df = sqlContext.createDataFrame([("foo", 1),
("foo", 1),
("bar", 2),
("foo", 3)], ('k', 'v'))
s_df.show()
+---+---+
| k| v|
+---+---+
|foo| 1|
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
Drop by subset
s_df.drop_duplicates(subset = ['k']).show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 1|
+---+---+
s_df.drop_duplicates().show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 3|
|foo| 1|
+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With