Suppose you have a list of values <pre class="prettyprint"><code>x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6)) </code></pre> I would like to find unique values from all list elements combined. So far, the following code did the trick <pre class="prettyprint"><code>unique(unlist(x)) </code></pre> Does anyone know a more efficient way? I have a hefty list with a lot of values and would appreciate any speed-up.

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful. <pre class="prettyprint"><code>> unique(unlist(x, use.names = FALSE)) [1] 1 2 3 4 5 6 </code></pre> <hr> <h3>Discussion</h3> A faster solution is to compute <code>unique()</code> on the components of your <code>x</code> first and then do a final <code>unique()</code> on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.: First your version, then my double unique approach: <pre class="prettyprint"><code>> unique(unlist(x)) [1] 1 2 3 4 5 6 > unique.default(sapply(x, unique)) [1] 1 2 3 4 5 6 </code></pre> We have to call <code>unique.default</code> as there is a <code>matrix</code> method for <code>unique</code> that keeps one margin fixed; this is fine as a matrix can be treated as a vector. Marek, in the comments to this answer, notes that the slow speed of the <code>unlist</code> approach is potentially due to the <code>names</code> on the list. Marek's solution is to make use of the <code>use.names</code> argument to <code>unlist</code>, which if used, results in a faster solution than the double unique version above. For the simple <code>x</code> of Roman's post we get <pre class="prettyprint"><code>> unique(unlist(x, use.names = FALSE)) [1] 1 2 3 4 5 6 </code></pre> Marek's solution will work even when the number of unique elements differs between components. Here is a larger example with some timings of all three methods: <pre class="prettyprint"><code>## Create a large list (1000 components of length 100 each) DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE), ncol = 1000))) </code></pre> Here are results for the two approaches using <code>DF</code>: <pre class="prettyprint"><code>> ## Do the three approaches give the same result: > all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF))) [1] TRUE > all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF))) [1] TRUE > ## Timing Roman's original: > system.time(replicate(10, unique(unlist(DF)))) user system elapsed 12.884 0.077 12.966 > ## Timing double unique version: > system.time(replicate(10, unique.default(sapply(DF, unique)))) user system elapsed 0.648 0.000 0.653 > ## timing of Marek's solution: > system.time(replicate(10, unique(unlist(DF, use.names = FALSE)))) user system elapsed 0.510 0.000 0.512 </code></pre> Which shows that the double <code>unique</code> is a lot quicker to applying <code>unique()</code> to the individual components and then <code>unique()</code> those smaller sets of unique values, but this speed-up is purely due to the <code>names</code> on the list <code>DF</code>. If we tell <code>unlist</code> to not use the <code>names</code>, Marek's solution is marginally quicker than the double <code>unique</code> for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution. The big gotcha with the double <code>unique</code> approach is that it will only work if, as in the two examples here, each component of the input list (<code>DF</code> or <code>x</code>) has the same number of unique values. In such cases <code>sapply</code> simplifies the result to a matrix which allows us to apply <code>unique.default</code>. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

finding unique values from a list

Tags:

list

r

unique

Suppose you have a list of values

x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))

I would like to find unique values from all list elements combined. So far, the following code did the trick

unique(unlist(x))

Does anyone know a more efficient way? I have a hefty list with a lot of values and would appreciate any speed-up.

201

asked Oct 07 '10 07:10

Roman Luštrik

1 Answers

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.

> unique(unlist(x, use.names = FALSE)) [1] 1 2 3 4 5 6

Discussion

A faster solution is to compute unique() on the components of your x first and then do a final unique() on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:

First your version, then my double unique approach:

> unique(unlist(x)) [1] 1 2 3 4 5 6 > unique.default(sapply(x, unique)) [1] 1 2 3 4 5 6

We have to call unique.default as there is a matrix method for unique that keeps one margin fixed; this is fine as a matrix can be treated as a vector.

Marek, in the comments to this answer, notes that the slow speed of the unlist approach is potentially due to the names on the list. Marek's solution is to make use of the use.names argument to unlist, which if used, results in a faster solution than the double unique version above. For the simple x of Roman's post we get

> unique(unlist(x, use.names = FALSE)) [1] 1 2 3 4 5 6

Marek's solution will work even when the number of unique elements differs between components.

Here is a larger example with some timings of all three methods:

## Create a large list (1000 components of length 100 each) DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE),                                  ncol = 1000)))

Here are results for the two approaches using DF:

> ## Do the three approaches give the same result: > all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF))) [1] TRUE > all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF))) [1] TRUE > ## Timing Roman's original: > system.time(replicate(10, unique(unlist(DF))))    user  system elapsed    12.884   0.077  12.966 > ## Timing double unique version: > system.time(replicate(10, unique.default(sapply(DF, unique))))    user  system elapsed    0.648   0.000   0.653 > ## timing of Marek's solution: > system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))    user  system elapsed    0.510   0.000   0.512

Which shows that the double unique is a lot quicker to applying unique() to the individual components and then unique() those smaller sets of unique values, but this speed-up is purely due to the names on the list DF. If we tell unlist to not use the names, Marek's solution is marginally quicker than the double unique for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.

The big gotcha with the double unique approach is that it will only work if, as in the two examples here, each component of the input list (DF or x) has the same number of unique values. In such cases sapply simplifies the result to a matrix which allows us to apply unique.default. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

165

answered Sep 19 '22 12:09

Gavin Simpson

Related questions
                            
                                Arrays.asList(int[]) not working [duplicate]
                            
                                Duplicate strings in a list and add integer suffixes to newly added ones
                            
                                How come list element lookup is O(1) in Python?
                            
                                List vs Queue vs Set of collections in Java
                            
                                Reason for - List list = new ArrayList(); [duplicate]
                            
                                How can I extract elements from lists of lists in R?
                            
                                Forming Bigrams of words in list of sentences with Python
                            
                                Column of lists, convert list to string as a new column
                            
                                Create a dictionary by zipping together two lists of uneven length [duplicate]
                            
                                Linux command 'll' is not working
                            
                                How is List<T>.IndexOf() implemented in C#?
                            
                                Extension method for List<T> AddToFront(T object) how to?
                            
                                How can I get a first element from a sorted list?
                            
                                Grouping a list into lists of n elements in Haskell
                            
                                mysql check if numbers are in a comma separated list
                            
                                How can List<T>.Item Property be O(1)? Typo?
                            
                                What's the standard algorithm for syncing two lists of related objects?
                            
                                C# The type or namespace name `List' could not be found. But I'm importing System.Collections.Generic;
                            
                                Issue warning for missing comma between list items bug
                            
                                Combine/merge lists by elements names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With