Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.
I will post the dataset example here just in case the question is deleted.
+------------+------+------+----------+--------------------------+ | Date | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A | A1 | 0 | 0 | | 2018-06-03 | A | A2 | 0 | 1 | | 2018-06-03 | A | A3 | 0 | 2 | | 2018-06-03 | A | A4 | 1 | 1 | | 2018-06-03 | A | A5 | 2 | 1 | | 2018-06-04 | A | A6 | 0 | 3 | | 2018-06-01 | B | B1 | 0 | 1 | | 2018-06-02 | B | B2 | 0 | 2 | | 2018-06-03 | B | B3 | 0 | 3 | +------------+------+------+----------+--------------------------+
As you can see this is not the right way to post data. As a user wrote in a comment,
It must've taken a bit of time to format the data the way you're showing it here. Unfortunately this is not a good format for us to copy & paste.
I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.
What can R code do to make that table usable, if anything? Will it take a great deal of trouble?
R offers a wide range of options for dealing with dirty data. The collection of packages known as the tidyverse, and adjacent packages that take a “tidy” approach, provide a range of functionality. From importing to cleaning to reshaping, these packages can help you quickly and efficiently clean messy data.
Put simply, it's an R package that has simple functions for examining and cleaning dirty data. It can format data frame column names, isolate duplicate and partially duplicate records, isolate empty and constant data, and much more!
Messy data is any other arrangement of the data. Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.
There are three interrelated rules which make a dataset tidy: Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.
Using data.table::fread
:
x = ' +------------+------+------+----------+--------------------------+ | Date | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A | A1 | 0 | 0 | | 2018-06-03 | A | A2 | 0 | 1 | | 2018-06-03 | A | A3 | 0 | 2 | | 2018-06-03 | A | A4 | 1 | 1 | | 2018-06-03 | A | A5 | 2 | 1 | | 2018-06-04 | A | A6 | 0 | 3 | | 2018-06-01 | B | B1 | 0 | 1 | | 2018-06-02 | B | B2 | 0 | 2 | | 2018-06-03 | B | B3 | 0 | 3 | +------------+------+------+----------+--------------------------+ ' fread(gsub('\\+.+\\n' ,'', x, perl = T), drop=c(1,7)) # Date Emp1 Case Priority PriorityCountinLast7days # 1: 2018-06-01 A A1 0 0 # 2: 2018-06-03 A A2 0 1 # 3: 2018-06-03 A A3 0 2 # 4: 2018-06-03 A A4 1 1 # 5: 2018-06-03 A A5 2 1 # 6: 2018-06-04 A A6 0 3 # 7: 2018-06-01 B B1 0 1 # 8: 2018-06-02 B B2 0 2 # 9: 2018-06-03 B B3 0 3
The gsub
part removes the horizontal rules. drop
removes the extra columns caused by delimiters at the line ends.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With