Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a factor based on value in one or more other columns

Tags:

r

r-factor

I've looked through a number of posts about ordering factors, but haven't quite found a match for my problem. Unfortunately, my knowledge of R is still pretty rudimentary.

I have a subset of an archaeological artifact catalog that I'm working with. I'm trying to cross-tabulate diagnostic historical artifact types and site testing locations. Easy enough with ddply or tapply.

My problem is that I want to sort the artifact types (a factor) by their mean diagnostic date (number/year), and I keep getting them alphabetically. I know I need to make it an ordered factor, but can't figure out how to order it by the year value in the other column.

IDENTIFY                                      MIDDATE
engine-turned fine red stoneware              1769
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
yellow lead-glazed buff earthenware           1732
...

Which needs to be ordered:

IDENTIFY                                      MIDDATE
yellow lead-glazed buff earthenware           1732
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
engine-turned fine red stoneware              1769
...

The factor (IDENTIFY) needs to be ordered by the Date (MIDDATE). I thought I had it with

Catalog$IDENTIFY<-factor(Catalog$IDENTIFY,levels=Catalog$MIDDATE,ordered=TRUE)

But get the warning:

In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
else paste0(labels,: duplicated levels will not be allowed 
in factors anymore

IDENTIFY has ~130 factor levels and many have the same value for MIDDATE, so I need to order IDENTIFY by MIDDATE and another column TYPENAME.

A little more detail:

I have a dataframe Catalog, which breaks down (i.e. str(Catalog)) as:

> str(Catalog)
'data.frame':   2211 obs. of  15 variables:
 $ TRENCH  : Factor w/ 7 levels "DRT 1","DRT 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ U_TYPE  : Factor w/ 3 levels "EU","INC","STP": 1 1 1 1 1 1 1 1 1 1 ...
 $ U_LBL   : Factor w/ 165 levels "001","005","007",..: 72 72 72 72 72 72 ...
 $ L_STRAT : Factor w/ 217 levels "#2-7/25","[3]",..: 4 4 4 4 4 4 89 89 89 89 ...
 $ START   : num  0 0 0 0 0 0 39.4 39.4 39.4 39.4 ...
 $ END     : num  39.4 39.4 39.4 39.4 39.4 39.4 43.2 43.2 43.2 43.2 ...
 $ Qty     : int  1 1 3 5 1 1 6 8 1 1 ...
 $ MATNAME : Factor w/ 6 levels "Ceramics","Chipped Stone",..: 1 1 1 5 5 6 ...
 $ TYPENAME: Factor w/ 9 levels "Architectural Hardware",..: 9 9 9 1 1 3 9 ...
 $ CATNAME : Factor w/ 32 levels "Biface","Bottle Glass",..: 24 29 29 6 24 ...
 $ IDENTIFY: Factor w/ 112 levels "amethyst bottle glass",..: 17 91 96 71 103 ...
 $ BEGDATE : int  1820 1820 1830 1835 1700 NA 1670 1762 1800 1720 ...
 $ ENDDATE : int  1900 1970 1860 1875 1820 NA 1795 1820 1820 1780 ...
 $ OCC_LBL : Ord.factor w/ 5 levels "Late 19th Century"<..: 2 1 2 2 4 5 4 3 ...
 $ MIDDATE : num  1860 1895 1845 1855 1760 ...

I need to make IDENTIFY an ordered factor, and reorder by MIDDATE -> TYPENAME -> alpha by IDENTIFY.

What I'm not really getting is how to reorder by combined orders from multiple columns.

I would just do this in the db, but a lot of what I'm running are weighted means within all sorts of cross-tabulations (e.g. weighted mean depths below ground surface for artifact classes by location)...

...doable in Access, but messy, and unpredictable. Far easier and cleaner to manage in R, but I'd rather not to have to manually re-sort the resulting tables.

What I'm trying to produce is a number of things along these lines:

>xtab.Catalog<-tapply(Catalog$Qty,list(Catalog$IDENTIFY,Catalog$TRENCH),sum)

IDENTIFY                        DRT1    DRT2    DRT3    DRT4    DRT5    DRT6
Staffordshire stoneware         4       NA      NA      NA      NA      NA  
undecorated delftware           6       4       NA      NA      NA      NA  
unidentified wrought nail       15      9       3       1       3       NA  
white salt-glazed stoneware     6       1       1       NA      2       1   
white salt-glazed scratch blue  1       NA      NA      NA      NA      NA  
white stoneware, slip-dipped    NA      NA      NA      NA      NA      NA  
wrought nail, 'L' head          2       NA      NA      NA      NA      NA  
wrought nail, 'rose' head       62      21      4       NA      1       1   
wrought nail, 'T' head          2       NA      1       NA      NA      1   
yellow lead-glazed              12      NA      NA      NA      1       3   
...

...but I need them to sort in the logical (i.e. chronological/type) order instead of alphabetical.

like image 981
Scard Avatar asked May 25 '12 16:05

Scard


2 Answers

Here's a reproducible sample, with solution:

set.seed(0)
a = sample(1:20,replace=F)
b = sample(1:20,replace=F)
f = as.factor(letters[1:20])

> a
 [1] 18  6  7 10 15  4 13 14  8 20  1  2  9  5  3 16 12 19 11 17
> b
 [1] 16 18  4 12  3  5  6  1 15 10 19 17  9 11  2  8 20  7 13 14
> f
 [1] a b c d e f g h i j k l m n o p q r s t
Levels: a b c d e f g h i j k l m n o p q r s t

Now for the new factor:

fn = factor(f, levels=unique(f[order(a,b,f)]), ordered=TRUE)

> fn
 [1] a b c d e f g h i j k l m n o p q r s t
20 Levels: k < l < o < f < n < b < c < i < m < d < s < q < g < h < e < ... < j

Sorted on 'a', next 'b' and finally 'f' itself (although in this example, 'a' has no repeated values).

like image 192
Matthew Lundberg Avatar answered Nov 03 '22 13:11

Matthew Lundberg


I recommend the following dplyr-based approach (h/t daattali) that can be extended to as many columns as you like:

library(dplyr)
Catalog <- Catalog %>%
  arrange(MIDDATE, TYPENAME) %>%               # sort your dataframe
  mutate(IDENTIFY = factor(IDENTIFY, unique(IDENTIFY))) # reset your factor-column based on that order
like image 32
zach Avatar answered Nov 03 '22 13:11

zach