Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tidy messy data [closed]

I have a messy data set, which generally resembles the output of the following

schools_messy <- tibble::tribble(
  ~data,
  "state:maryland",
  "location:bowie||name:bowie state university",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
  "location:baltimore||name:coppin state university",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
  
  "state:virginia",
  "location:williamsburg||name:college of william and mary",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
  "location:fairfax||name:george mason university",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
)

My desired end state is to have it resemble the output of the following

schools_tidy <- tribble(
  ~state, ~location, ~name, ~grade, ~count,
  "maryland", "bowie", "bowie state university", "freshman", 100,
  "maryland", "bowie", "bowie state university", "sophomore", 200,
  "maryland", "bowie", "bowie state university", "junior", 300,
  "maryland", "bowie", "bowie state university", "senior", 400,
  "maryland", "baltimore", "coppin state university", "freshman", 100,
  "maryland", "baltimore", "coppin state university", "sophomore", 200,
  "maryland", "baltimore", "coppin state university", "junior", 300,
  "maryland", "baltimore", "coppin state university", "senior", 400,
  "virginia", "williamsburg", "college of william and mary", "freshman", 100,
  "virginia", "williamsburg", "college of william and mary", "sophomore", 200,
  "virginia", "williamsburg", "college of william and mary", "junior", 300,
  "virginia", "williamsburg", "college of william and mary", "senior", 400,
  "virginia", "fairfax", "george mason university", "freshman", 100,
  "virginia", "fairfax", "george mason university", "sophomore", 200,
  "virginia", "fairfax", "george mason university", "junior", 300,
  "virginia", "fairfax", "george mason university", "senior", 400,
)

I'm afraid that I'm quite at a loss for how to go about cleaning up the data set. It almost resembles a mangled json data set. I tried to do some manipulation to turn it into a proper json data set, then convert it to the desired tibble end state from there, but I was unsuccessful.

like image 818
David Robie Avatar asked Sep 21 '25 12:09

David Robie


1 Answers

In base R you can do the following:

a <- sub("(count:\\d+$)", "\\1\n\n", schools_messy$data)
b <- textConnection(sub("||", "\n", a, fixed = TRUE))
d <- read.dcf(b, all = TRUE)
d[] <- lapply(d, \(x)x[cummax(seq_along(x) * !is.na(x))])
d
      state     location                        name     grade count
1  maryland        bowie      bowie state university  freshman   100
2  maryland        bowie      bowie state university sophomore   200
3  maryland        bowie      bowie state university    junior   300
4  maryland        bowie      bowie state university    senior   400
5  maryland    baltimore     coppin state university  freshman   100
6  maryland    baltimore     coppin state university sophomore   200
7  maryland    baltimore     coppin state university    junior   300
8  maryland    baltimore     coppin state university    senior   400
9  virginia williamsburg college of william and mary  freshman   100
10 virginia williamsburg college of william and mary sophomore   200
11 virginia williamsburg college of william and mary    junior   300
12 virginia williamsburg college of william and mary    senior   400
13 virginia      fairfax     george mason university  freshman   100
14 virginia      fairfax     george mason university sophomore   200
15 virginia      fairfax     george mason university    junior   300
16 virginia      fairfax     george mason university    senior   400
like image 117
KU99 Avatar answered Sep 23 '25 02:09

KU99