Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting first and last positions in a dataset

Tags:

r

dplyr

I have this dataset that I'm trying to transform to get the "from" and "to" positions within a particular grouping of data points that pass a test.

Here's how the data looks:

pos <- seq(from = 10, to = 100, by = 10)
test <- c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0)
df <- data.frame(pos, test)

So you can see that positions 10, 20, and 30, as well as 70, 80, and 90 pass the test (b/c test = 1) but the rest of the points don't. The answer I'm looking for would be a data frame that looks something like the "answer" data frame in the code below:

peaknum <- c(1, 2)
from <- c(10, 70)
to <- c(30, 90)
answer <- data.frame(peaknum, from, to)

Any suggestions as to how I can transform the dataset? I'm stumped.

Thanks, Steve

like image 394
Steven Avatar asked Mar 17 '16 19:03

Steven


People also ask

How do you extract the first 10 rows in a data frame?

You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.

How do I extract the first row of a data frame?

If you want the first row of dataframe as a dataframe object then you can provide the range i.e.[:1], instead of direct number i.e. It will select the rows from number 0 to 1 and return the first row of dataframe as a dataframe object.

How the last 5 rows can be extracted?

The last n rows of the data frame can be accessed by using the in-built tail() method in R. Supposedly, N is the total number of rows in the data frame, then n <=N last rows can be extracted from the structure.


2 Answers

We can use data.table. Use the rleid function to create the run-length group ids ('peaknum') based on the adjacent values that are same 'test'. Using 'peaknum' as grouping variable, we get the 'min' and 'max' of 'pos', while specifying the 'i' as 'test==1' to subset the rows. If needed, the 'peaknum' values can be changed to the sequence ('seq_len(.N)`).

library(data.table)
setDT(df)[, peaknum:= rleid(test)][test==1, 
   list(from=min(pos), to=max(pos)) ,peaknum][, peaknum:= seq_len(.N)]
#   peaknum from to
#1:       1   10 30
#2:       2   70 90
like image 154
akrun Avatar answered Sep 21 '22 21:09

akrun


We can do it with dplyr, though separating the nodes is a little ugly:

library(dplyr)
df %>% group_by(peaknum = rep(seq(rle(test)[['lengths']]), rle(test)[['lengths']])) %>% 
  filter(test == 1) %>% 
  summarise(from = min(pos), 
            to = max(pos)) %>%
  mutate(peaknum = seq_along(peaknum))

# Source: local data frame [2 x 3]

#   peaknum  from    to
#     (int) (dbl) (dbl)
# 1       1    10    30
# 2       2    70    90

What it does:

  • the first group_by uses rle to add a column that is a sequence along the repeated numbers in test, and groups it for summarise later;
  • filter chops rows down to only those where test is 1
  • summarise collapses the groups and adds max and min for each,
  • and lastly mutate cleans up the numbering of peaknum.
like image 25
alistaire Avatar answered Sep 20 '22 21:09

alistaire