Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can R do about a messy data format?

Tags:

dataframe

r

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+ |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A    | A1   |        0 |                        0 | | 2018-06-03 | A    | A2   |        0 |                        1 | | 2018-06-03 | A    | A3   |        0 |                        2 | | 2018-06-03 | A    | A4   |        1 |                        1 | | 2018-06-03 | A    | A5   |        2 |                        1 | | 2018-06-04 | A    | A6   |        0 |                        3 | | 2018-06-01 | B    | B1   |        0 |                        1 | | 2018-06-02 | B    | B2   |        0 |                        2 | | 2018-06-03 | B    | B3   |        0 |                        3 | +------------+------+------+----------+--------------------------+ 

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're showing it here. Unfortunately this is not a good format for us to copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

like image 946
Rui Barradas Avatar asked Aug 26 '18 06:08

Rui Barradas


People also ask

How do you handle messy data in R?

R offers a wide range of options for dealing with dirty data. The collection of packages known as the tidyverse, and adjacent packages that take a “tidy” approach, provide a range of functionality. From importing to cleaning to reshaping, these packages can help you quickly and efficiently clean messy data.

Is R good for data cleaning?

Put simply, it's an R package that has simple functions for examining and cleaning dirty data. It can format data frame column names, isolate duplicate and partially duplicate records, isolate empty and constant data, and much more!

What is messy data in R?

Messy data is any other arrangement of the data. Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.

How do you make tidy data in R?

There are three interrelated rules which make a dataset tidy: Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.


1 Answers

Using data.table::fread:

x = ' +------------+------+------+----------+--------------------------+ |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A    | A1   |        0 |                        0 | | 2018-06-03 | A    | A2   |        0 |                        1 | | 2018-06-03 | A    | A3   |        0 |                        2 | | 2018-06-03 | A    | A4   |        1 |                        1 | | 2018-06-03 | A    | A5   |        2 |                        1 | | 2018-06-04 | A    | A6   |        0 |                        3 | | 2018-06-01 | B    | B1   |        0 |                        1 | | 2018-06-02 | B    | B2   |        0 |                        2 | | 2018-06-03 | B    | B3   |        0 |                        3 | +------------+------+------+----------+--------------------------+ '  fread(gsub('\\+.+\\n' ,'', x, perl = T), drop=c(1,7))  #          Date Emp1 Case Priority PriorityCountinLast7days # 1: 2018-06-01    A   A1        0                        0 # 2: 2018-06-03    A   A2        0                        1 # 3: 2018-06-03    A   A3        0                        2 # 4: 2018-06-03    A   A4        1                        1 # 5: 2018-06-03    A   A5        2                        1 # 6: 2018-06-04    A   A6        0                        3 # 7: 2018-06-01    B   B1        0                        1 # 8: 2018-06-02    B   B2        0                        2 # 9: 2018-06-03    B   B3        0                        3 

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

like image 130
dww Avatar answered Oct 01 '22 03:10

dww