Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cut function split by factor levels

Tags:

r

I have a problem with the cut function. I have this situation:

 codice
1 11GP2-0016
2 11GP2-0016
3 11GP2-0016
4  11OL2-074
5  11OL2-074    

and I would like to have a new variable "campione" splitted by variable "codice" like this:

    codice campione
1 11GP2-0016    [1,3]
2 11GP2-0016    [1,3]
3 11GP2-0016    [1,3]
4  11OL2-074    (4,5]
5  11OL2-074    (4,5]

How can I use the cut function to split the "codice" creating a variable showing that from 1 to 3 i have the same code, from 4 to 5 same code and so on?

I need to solve another question. For the same issue I would like to obtain:

 codice campione
1 11GP2-0016    [11GP2-0016,11GP2-0016,11GP2-0016]
2 11GP2-0016    [11GP2-0016,11GP2-0016,11GP2-0016]
3 11GP2-0016    [11GP2-0016,11GP2-0016,11GP2-0016]
4  11OL2-074    (11OL2-074,11OL2-074]
5  11OL2-074    (11OL2-074,11OL2-074]

Is there any solution to do this?

like image 422
Spigonico Avatar asked Oct 18 '12 15:10

Spigonico


2 Answers

This will do it. You can add brackets/parens, if you want.

dat <- read.table(text='codice
1 11GP2-0016
2 11GP2-0016
3 11GP2-0016
4  11OL2-074
5  11OL2-074', header=TRUE)

within(dat, 
    campione <- with(rle(as.character(codice)), {
        starts <- which(! duplicated(codice))
        ends <- starts + lengths - 1
        inverse.rle(list(values=paste(starts, ends, sep=','), lengths=lengths))
    })
)

#       codice campione
# 1 11GP2-0016      1,3
# 2 11GP2-0016      1,3
# 3 11GP2-0016      1,3
# 4  11OL2-074      4,5
# 5  11OL2-074      4,5       
like image 170
Matthew Plourde Avatar answered Sep 30 '22 10:09

Matthew Plourde


Using your data:

d <- read.table(text = "1 11GP2-0016
2 11GP2-0016
3 11GP2-0016
4  11OL2-074
5  11OL2-074", row.names = 1, stringsAsFactors = FALSE)
names(d) <- "codice"

Here is a slightly convoluted example using rle():

drle <- with(d, rle(codice))

This gives us the run lengths of codice:

> drle
Run Length Encoding
  lengths: int [1:2] 3 2
  values : chr [1:2] "11GP2-0016" "11OL2-074"

and it is the $lengths component that I manipulate to create two indicates, the start (ind1) and the end (ind2) location:

ind1 <- with(drle, rep(seq_along(lengths), times = lengths) +
                     rep(c(0, head(lengths, -1) - 1), times = lengths))
ind2 <- ind1 + with(drle, rep(lengths- 1, times = lengths))

Then I just paste these together:

d <- transform(d, campione = paste0("[", ind1, ",", ind2, "]"))

Giving

> head(d)
      codice campione
1 11GP2-0016    [1,3]
2 11GP2-0016    [1,3]
3 11GP2-0016    [1,3]
4  11OL2-074    [4,5]
5  11OL2-074    [4,5]
like image 32
Gavin Simpson Avatar answered Sep 30 '22 10:09

Gavin Simpson