What can R do about a messy data format?

Tags:

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+ |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A    | A1   |        0 |                        0 | | 2018-06-03 | A    | A2   |        0 |                        1 | | 2018-06-03 | A    | A3   |        0 |                        2 | | 2018-06-03 | A    | A4   |        1 |                        1 | | 2018-06-03 | A    | A5   |        2 |                        1 | | 2018-06-04 | A    | A6   |        0 |                        3 | | 2018-06-01 | B    | B1   |        0 |                        1 | | 2018-06-02 | B    | B2   |        0 |                        2 | | 2018-06-03 | B    | B3   |        0 |                        3 | +------------+------+------+----------+--------------------------+

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're showing it here. Unfortunately this is not a good format for us to copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

946

asked Aug 26 '18 06:08

Rui Barradas

1 Answers

Using data.table::fread:

x = ' +------------+------+------+----------+--------------------------+ |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A    | A1   |        0 |                        0 | | 2018-06-03 | A    | A2   |        0 |                        1 | | 2018-06-03 | A    | A3   |        0 |                        2 | | 2018-06-03 | A    | A4   |        1 |                        1 | | 2018-06-03 | A    | A5   |        2 |                        1 | | 2018-06-04 | A    | A6   |        0 |                        3 | | 2018-06-01 | B    | B1   |        0 |                        1 | | 2018-06-02 | B    | B2   |        0 |                        2 | | 2018-06-03 | B    | B3   |        0 |                        3 | +------------+------+------+----------+--------------------------+ '  fread(gsub('\\+.+\\n' ,'', x, perl = T), drop=c(1,7))  #          Date Emp1 Case Priority PriorityCountinLast7days # 1: 2018-06-01    A   A1        0                        0 # 2: 2018-06-03    A   A2        0                        1 # 3: 2018-06-03    A   A3        0                        2 # 4: 2018-06-03    A   A4        1                        1 # 5: 2018-06-03    A   A5        2                        1 # 6: 2018-06-04    A   A6        0                        3 # 7: 2018-06-01    B   B1        0                        1 # 8: 2018-06-02    B   B2        0                        2 # 9: 2018-06-03    B   B3        0                        3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

130

answered Oct 01 '22 03:10

dww

Related questions
                            
                                How to pass extra argument to the function argument of do.call in R
                            
                                How to install R package from private repo using devtools install_github?
                            
                                Release memory in R
                            
                                Changing font in PDF produced by rmarkdown
                            
                                Set the size of ggsave exactly
                            
                                How to do printf in r?
                            
                                R Random Forests Variable Importance
                            
                                What is the difference between a list and a pairlist in R?
                            
                                How to draw a nice arrow in ggplot2
                            
                                How to check the amount of RAM in R
                            
                                How do I prevent "r 'library' or 'require' calls not declared" warnings when developing a package?
                            
                                Creating vector of results of repeated function calls in R
                            
                                ggplot2 - The unit of size
                            
                                Why (or when) is Rscript (or littler) better than R CMD BATCH?
                            
                                Where should I put data for automated tests with testthat?
                            
                                Growing a data.frame in a memory-efficient manner
                            
                                How to remove a level of lists from a list of lists
                            
                                Can't load X11 in R after OS X Yosemite upgrade
                            
                                Don't drop zero count: dodged barplot
                            
                                How to add rows to empty data frames with header in R? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What can R do about a messy data format?

Tags:

dataframe

r

Rui Barradas

People also ask

1 Answers

dww

Recent Activity

Donate For Us