I have a text data file like this (with multiple rows for respective category [A,B,C]):
A=1,2,3,6,
7,9
10
B=3,4,5
C=5,7,8,10,11,
13,14
When I load to R it becomes a data frame with 6 obs. and 1 variable
df <- read.delim("file.text",header = FALSE)
v1
1 A=1,2,3,6,
2 7,9
3 10
4 B=3,4,5
5 C=5,7,8,10,11,
6 13,14
How can I change to this format?
A B C
1 A
2 A
3 A B
4 B
5 B C
6 A
7 A C
8 C
9 A
10 A C
11 C
13 C
14 C
Thanks!
The basic approach would be to read the data in as a string, split on = and ,, and then figure out the best way to identify which group each number belongs with.
In the following approach, I've intentionally used type conversion to create the groups. Since this will involve coercion from character to numeric, since there are characters in the split values, expect several warning messages.
# Load the data.table package.
library(data.table)
# Read in the data.
x <- fread("A=1,2,3,6,
7,9
10
B=3,4,5
C=5,7,8,10,11,
13,14", sep = "\n", header = FALSE)
x[, unlist(strsplit(V1, "=|,"), use.names = FALSE, recursive = FALSE), .I][
, list(ind = as.integer(V1), col = rep(V1[1], .N)), cumsum(is.na(as.integer(V1)))][
, dcast(na.omit(.SD), ind ~ col, value.var = "col", fill = "")]
# ind A B C
# 1: 1 A
# 2: 2 A
# 3: 3 A B
# 4: 4 B
# 5: 5 B C
# 6: 6 A
# 7: 7 A C
# 8: 8 C
# 9: 9 A
# 10: 10 A C
# 11: 11 C
# 12: 13 C
# 13: 14 C
Here's another alternative using cSplit from my "splitstackshape" package. "x" is the same data read in using fread.
library(splitstackshape)
cSplit(
cSplit(x[, toString(V1), cumsum(grepl("[A-Z]", V1))], "V1", "="), "V1_2", ",", "long")[
, dcast(.SD, V1_2 ~ V1_1, value.var = "V1_1", fill = "")]
Here is a tidyverse approach.
library(dplyr)
library(tidyr)
df %>%
#Remove commas at the end of the string
mutate(V1 = sub(",$", "", V1)) %>%
#Divide data into separate column based on "=" sign
separate(V1, into = c("let", "num"), sep = "=", fill = "left") %>%
#fill the NA values with it's respective group
fill(let) %>%
#Convert comma-separated value into different rows
separate_rows(num, convert = TRUE) %>%
#Convert data to wide format filling missing values with empty string
pivot_wider(names_from = let, values_from = let, values_fill = list(let = "")) %>%
#Arrange data according to numbers
arrange(num)
# # A tibble: 13 x 4
# num A B C
# <int> <chr> <chr> <chr>
# 1 1 A "" ""
# 2 2 A "" ""
# 3 3 A B ""
# 4 4 "" B ""
# 5 5 "" B C
# 6 6 A "" ""
# 7 7 A "" C
# 8 8 "" "" C
# 9 9 A "" ""
#10 10 A "" C
#11 11 "" "" C
#12 13 "" "" C
#13 14 "" "" C
If you want num column as rownames add %>% column_to_rownames('num').
data
df <- read.table(text = "A=1,2,3,6,
7,9
10
B=3,4,5
C=5,7,8,10,11,
13,14")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With