I have this dataset that I'm trying to transform to get the "from" and "to" positions within a particular grouping of data points that pass a test.
Here's how the data looks:
pos <- seq(from = 10, to = 100, by = 10)
test <- c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0)
df <- data.frame(pos, test)
So you can see that positions 10, 20, and 30, as well as 70, 80, and 90 pass the test (b/c test = 1) but the rest of the points don't. The answer I'm looking for would be a data frame that looks something like the "answer" data frame in the code below:
peaknum <- c(1, 2)
from <- c(10, 70)
to <- c(30, 90)
answer <- data.frame(peaknum, from, to)
Any suggestions as to how I can transform the dataset? I'm stumped.
Thanks, Steve
You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.
If you want the first row of dataframe as a dataframe object then you can provide the range i.e.[:1], instead of direct number i.e. It will select the rows from number 0 to 1 and return the first row of dataframe as a dataframe object.
The last n rows of the data frame can be accessed by using the in-built tail() method in R. Supposedly, N is the total number of rows in the data frame, then n <=N last rows can be extracted from the structure.
We can use data.table
. Use the rleid
function to create the run-length group ids ('peaknum') based on the adjacent values that are same 'test'. Using 'peaknum' as grouping variable, we get the 'min' and 'max' of 'pos', while specifying the 'i' as 'test==1' to subset the rows. If needed, the 'peaknum' values can be changed to the sequence ('seq_len(.N)`).
library(data.table)
setDT(df)[, peaknum:= rleid(test)][test==1,
list(from=min(pos), to=max(pos)) ,peaknum][, peaknum:= seq_len(.N)]
# peaknum from to
#1: 1 10 30
#2: 2 70 90
We can do it with dplyr
, though separating the nodes is a little ugly:
library(dplyr)
df %>% group_by(peaknum = rep(seq(rle(test)[['lengths']]), rle(test)[['lengths']])) %>%
filter(test == 1) %>%
summarise(from = min(pos),
to = max(pos)) %>%
mutate(peaknum = seq_along(peaknum))
# Source: local data frame [2 x 3]
# peaknum from to
# (int) (dbl) (dbl)
# 1 1 10 30
# 2 2 70 90
What it does:
group_by
uses rle
to add a column that is a sequence along the repeated numbers in test
, and groups it for summarise
later;filter
chops rows down to only those where test
is 1
summarise
collapses the groups and adds max
and min
for each,mutate
cleans up the numbering of peaknum
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With