Say I have a dataframe df with two or more columns, is there an easy way to use <code>unique()</code> or other R function to create a subset of unique combinations of two or more columns? I know I can use <code>sqldf()</code> and write an easy <code>"SELECT DISTINCT var1, var2, ... varN"</code> query, but I am looking for an R way of doing this. It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset: <pre class="prettyprint"><code>uniques <- as.data.frame(ftable(df$var1, df$var2)) </code></pre>

<code>unique</code> works on <code>data.frame</code> so <code>unique(df[c("var1","var2")])</code> should be what you want. Another option is <code>distinct</code> from <code>dplyr</code> package: <pre class="prettyprint lang-r prettyprint-override"><code>df %>% distinct(var1, var2) # or distinct(df, var1, var2) </code></pre> Note: For older versions of dplyr (< 0.5.0, 2016-06-24) <code>distinct</code> required additional step <pre class="prettyprint lang-r prettyprint-override"><code>df %>% select(var1, var2) %>% distinct </code></pre> (or oldish way <code>distinct(select(df, var1, var2))</code>).

R equivalent of SELECT DISTINCT on two or more fields/variables

Tags:

sql

dataframe

r

Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns?

I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... varN" query, but I am looking for an R way of doing this.

It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:

uniques <- as.data.frame(ftable(df$var1, df$var2))

475

asked May 24 '10 21:05

wahalulu

1 Answers

unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).

answered Oct 22 '22 05:10

Marek

Related questions
                            
                                How to execute UNION without sorting? (SQL)
                            
                                How to combine GROUP BY, ORDER BY and HAVING
                            
                                How to retrieve JSON data from MySQL?
                            
                                Average of multiple columns
                            
                                Date / Timestamp to record when a record was added to the table? [duplicate]
                            
                                SQL Server Insert Example
                            
                                Does COUNT(*) always return a result?
                            
                                How can I retrieve the logical file name of the database from backup file
                            
                                How to use a TRIM function in SQL Server
                            
                                Athena greater than condition in date column
                            
                                Cannot create SSPI context
                            
                                Postgres: select all row with count of a field greater than 1
                            
                                Normalization in plain English
                            
                                What's the correct name for an "association table" (a many-to-many relationship) [closed]
                            
                                Entity Framework - attribute IN Clause usage
                            
                                Database schema for organizing historical stock data
                            
                                How to set the default schema of a database in SQL Server 2005?
                            
                                Is COUNT(*) in SQL Server a constant time operation? If not, why not?
                            
                                Query that ignore the spaces
                            
                                Count number of occurrences for each unique value [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With