I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5: <img src="https://i.stack.imgur.com/X3ndH.png" alt="sample spreadsheet"> How do I automate the search of Excel cells for my "Monthly return" string using R? At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution

I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain. For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good. Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken. <pre class="prettyprint"><code>library(readxl) library(stringr) library(dplyr) f_path <- file.path("whatever.xlsx") if (!file.exists(f_path)) { f_path <- file.choose() } # I read this twice, temp_read to figure out where the data actually starts... # Maybe you need something like this - # excel_sheets <- readxl::excel_sheets(f_path) # desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution")) desired_sheet <- 1 temp_read <- readxl::read_excel(f_path,sheet = desired_sheet) skip_rows <- NULL col_skip <- 0 search_string <- "Monthly Returns" max_cols_to_search <- 10 max_rows_to_search <- 10 # Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later. while (length(skip_rows) == 0) { col_skip <- col_skip + 1 if (col_skip == max_cols_to_search) break skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0 } # ... now we re-read from the known good starting point. real_data <- readxl::read_excel( f_path, sheet = desired_sheet, skip = skip_rows ) # You likely don't need this if you start at the right row # But given that all weird spreadsheets are weird in their own way # You may want to operate on the col_skip, maybe like so: # real_data <- real_data %>% # select(-(1:col_skip)) </code></pre>

With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use <code>lapply()</code> or <code>purrr::map()</code>. This function will encapsulate the skip-learning read and the "real" read. <pre class="prettyprint lang-r prettyprint-override"><code>library(readxl) two_passes <- function(path, sheet = NULL, n_max = 10) { first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max) skip <- which(first_pass[[2]] == "Monthly return") message("For sheet '", if (is.null(sheet)) 1 else sheet, "' we'll skip ", skip, " rows.") read_excel(path, sheet = sheet, skip = skip) } (sheets <- excel_sheets("so.xlsx")) #> [1] "sheet_one" "sheet_two" sheets <- setNames(sheets, sheets) lapply(sheets, two_passes, path = "so.xlsx") #> For sheet 'sheet_one' we'll skip 4 rows. #> For sheet 'sheet_two' we'll skip 6 rows. #> $sheet_one #> # A tibble: 6 × 2 #> X__1 `Monthly return` #> <dttm> <dbl> #> 1 2017-03-14 0.00907 #> 2 2017-04-14 0.00069 #> 3 2017-05-14 0.01890 #> 4 2017-06-14 0.00803 #> 5 2017-07-14 -0.01998 #> 6 2017-08-14 0.00697 #> #> $sheet_two #> # A tibble: 6 × 2 #> X__1 `Monthly return` #> <dttm> <dbl> #> 1 2017-03-14 0.00907 #> 2 2017-04-14 0.00069 #> 3 2017-05-14 0.01890 #> 4 2017-06-14 0.00803 #> 5 2017-07-14 -0.01998 #> 6 2017-08-14 0.00697 </code></pre>

In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table. I have this Excel book: <img src="https://i.stack.imgur.com/ordVb.png" alt="enter image description here"> I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left. First: load data <pre class="prettyprint"><code>library(xlsx) dat <- read.xlsx('book.xlsx', sheetIndex = 1) head(dat) MY.COMPANY.PTY.LTD NA. 1 MC Pension Fund <NA> 2 GROSS PERFORMANCE DETAILS <NA> 3 updated by IG on 20/04/2017 <NA> 4 <NA> Monthly return 5 Mar-14 0.0097 6 Apr-14 6e-04 </code></pre> Second: I added some cols with <code>NA</code> and <code>''</code> values in the case that your data contain some <pre class="prettyprint"><code>dat$x2 <- NA dat$x4 <- NA head(dat) MY.COMPANY.PTY.LTD NA. x2 x4 1 MC Pension Fund <NA> NA NA 2 GROSS PERFORMANCE DETAILS <NA> NA NA 3 updated by IG on 20/04/2017 <NA> NA NA 4 <NA> Monthly return NA NA 5 Mar-14 0.0097 NA NA 6 Apr-14 6e-04 NA NA </code></pre> Third: Remove columns when all values are <code>NA</code> and <code>''</code>. I have to deal with that kind of problems in past <pre class="prettyprint"><code>colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x))))) dat2 <- dat[, colSelect] head(dat2) MY.COMPANY.PTY.LTD NA. 1 MC Pension Fund <NA> 2 GROSS PERFORMANCE DETAILS <NA> 3 updated by IG on 20/04/2017 <NA> 4 <NA> Monthly return 5 Mar-14 0.0097 6 Apr-14 6e-04 </code></pre> Fourth: Keep only rows with complete observations (it's what I supose from your example) <pre class="prettyprint"><code>rowSelect <- apply(dat2, 1, function(x) !any(is.na(x))) dat3 <- dat2[rowSelect, ] head(dat3) MY.COMPANY.PTY.LTD NA. 5 Mar-14 0.0097 6 Apr-14 6e-04 7 May-14 0.0189 8 Jun-14 0.008 9 Jul-14 -0.0199 10 Ago-14 0.00697 </code></pre> Finally if you want to keep the header you can make something like this: <pre class="prettyprint"><code>colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ]) </code></pre> or <pre class="prettyprint"><code>colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2])) dat3 Month Monthly return 5 Mar-14 0.0097 6 Apr-14 6e-04 7 May-14 0.0189 8 Jun-14 0.008 9 Jul-14 -0.0199 10 Ago-14 0.00697 </code></pre>

Reading Excel file: How to find the start cell in messy spreadsheets?

Tags:

r

excel

I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5:

sample spreadsheet

How do I automate the search of Excel cells for my "Monthly return" string using R?

At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution

539

asked Apr 05 '17 22:04

lebelinoz

4 Answers

I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.

For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.

Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.

library(readxl)
library(stringr)
library(dplyr)

f_path <- file.path("whatever.xlsx")

if (!file.exists(f_path)) {
  f_path <- file.choose()
}

# I read this twice, temp_read to figure out where the data actually starts...

# Maybe you need something like this - 
#   excel_sheets <- readxl::excel_sheets(f_path)
#   desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)

skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10

# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
  col_skip <- col_skip + 1
  if (col_skip == max_cols_to_search) break
  skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0

}

# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
  f_path,
  sheet = desired_sheet,
  skip = skip_rows
)

# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
#   select(-(1:col_skip))

131

answered Oct 02 '22 15:10

Rafael Zayas

Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.

library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)

You would get something similar to:

data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("", 
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund", 
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"), 
    V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("", 
    "0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1", 
"V2"), class = "data.frame", row.names = c(NA, -7L))

then you can dynamincally filter on the "Monthly return" cell and identify your matrix.

targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]

answered Oct 02 '22 17:10

Tonio Liebrand

With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.

library(readxl)

two_passes <- function(path, sheet = NULL, n_max = 10) {
  first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
  skip <- which(first_pass[[2]] == "Monthly return")
  message("For sheet '", if (is.null(sheet)) 1 else sheet,
          "' we'll skip ", skip, " rows.")  
  read_excel(path, sheet = sheet, skip = skip)
}

(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#>         X__1 `Monthly return`
#>       <dttm>            <dbl>
#> 1 2017-03-14          0.00907
#> 2 2017-04-14          0.00069
#> 3 2017-05-14          0.01890
#> 4 2017-06-14          0.00803
#> 5 2017-07-14         -0.01998
#> 6 2017-08-14          0.00697
#> 
#> $sheet_two
#> # A tibble: 6 × 2
#>         X__1 `Monthly return`
#>       <dttm>            <dbl>
#> 1 2017-03-14          0.00907
#> 2 2017-04-14          0.00069
#> 3 2017-05-14          0.01890
#> 4 2017-06-14          0.00803
#> 5 2017-07-14         -0.01998
#> 6 2017-08-14          0.00697

answered Oct 02 '22 15:10

jennybryan

In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.

I have this Excel book: enter image description here

I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.

First: load data

library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           <NA>
2    GROSS PERFORMANCE DETAILS           <NA>
3 updated  by IG on 20/04/2017           <NA>
4                         <NA> Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Second: I added some cols with NA and '' values in the case that your data contain some

dat$x2 <- NA
dat$x4 <- NA
head(dat)

            MY.COMPANY.PTY.LTD            NA. x2 x4
1             MC  Pension Fund           <NA> NA NA
2    GROSS PERFORMANCE DETAILS           <NA> NA NA
3 updated  by IG on 20/04/2017           <NA> NA NA
4                         <NA> Monthly return NA NA
5                       Mar-14         0.0097 NA NA
6                       Apr-14          6e-04 NA NA

Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past

colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           <NA>
2    GROSS PERFORMANCE DETAILS           <NA>
3 updated  by IG on 20/04/2017           <NA>
4                         <NA> Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Fourth: Keep only rows with complete observations (it's what I supose from your example)

rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)

   MY.COMPANY.PTY.LTD     NA.
5              Mar-14  0.0097
6              Apr-14   6e-04
7              May-14  0.0189
8              Jun-14   0.008
9              Jul-14 -0.0199
10             Ago-14 0.00697

Finally if you want to keep the header you can make something like this:

colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])

colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3

    Month Monthly return
5  Mar-14         0.0097
6  Apr-14          6e-04
7  May-14         0.0189
8  Jun-14          0.008
9  Jul-14        -0.0199
10 Ago-14        0.00697

answered Oct 02 '22 17:10

gonzalez.ivan90

Related questions
                            
                                SpatialPoints and SpatialPointsDataframe
                            
                                In ggplot2 and facet_wrap, how to remove all margins and padding yet keep strip.text?
                            
                                ggplot renaming facet labels in facet_wrap
                            
                                How to get the nth row from data frame in R
                            
                                is it better to use integer64, numeric, or character for large integer id numbers?
                            
                                How to type tilde in formulas in RMarkdown
                            
                                Replace all NA values for variable with one row equal to 0
                            
                                How do I best simulate an arbitrary univariate random variate using its probability function?
                            
                                How can I switch between R sessions in emacs-ess?
                            
                                Is there a command to export the current plot?
                            
                                Adding custom tick marks on R plot
                            
                                Building shared libraries in subdirectories
                            
                                GraphViz: Windows PATH not set with new installer, issue when calling from R
                            
                                Empty factors in "by" data.table
                            
                                Color axis text by variable
                            
                                Integrating error: maximum number of subdivisions reached
                            
                                The perils of aligning plots in ggplot
                            
                                Efficiently remove all NULL values in a list and all sublists
                            
                                Adding custom CSS tags to an RMarkdown html document
                            
                                Variables as default arguments of a function, using dplyr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With