Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading Excel file: How to find the start cell in messy spreadsheets?

Tags:

r

excel

I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5:

sample spreadsheet

How do I automate the search of Excel cells for my "Monthly return" string using R?

At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution

like image 539
lebelinoz Avatar asked Apr 05 '17 22:04

lebelinoz


People also ask

How do I separate messy data in Excel?

You will use Excel's built-in function to remove duplicates, as shown below. The original dataset has two rows as duplicates. To eliminate the duplicate data, you need to select the data option in the toolbar, and in the Data Tools ribbon, select the "Remove Duplicates" option.


4 Answers

I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.

For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.

Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.

library(readxl)
library(stringr)
library(dplyr)

f_path <- file.path("whatever.xlsx")

if (!file.exists(f_path)) {
  f_path <- file.choose()
}

# I read this twice, temp_read to figure out where the data actually starts...

# Maybe you need something like this - 
#   excel_sheets <- readxl::excel_sheets(f_path)
#   desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)

skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10

# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
  col_skip <- col_skip + 1
  if (col_skip == max_cols_to_search) break
  skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0

}

# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
  f_path,
  sheet = desired_sheet,
  skip = skip_rows
)

# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
#   select(-(1:col_skip))
like image 131
Rafael Zayas Avatar answered Oct 02 '22 15:10

Rafael Zayas


Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.

library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)

You would get something similar to:

data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("", 
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund", 
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"), 
    V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("", 
    "0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1", 
"V2"), class = "data.frame", row.names = c(NA, -7L))

then you can dynamincally filter on the "Monthly return" cell and identify your matrix.

targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
like image 25
Tonio Liebrand Avatar answered Oct 02 '22 17:10

Tonio Liebrand



With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.

library(readxl)

two_passes <- function(path, sheet = NULL, n_max = 10) {
  first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
  skip <- which(first_pass[[2]] == "Monthly return")
  message("For sheet '", if (is.null(sheet)) 1 else sheet,
          "' we'll skip ", skip, " rows.")  
  read_excel(path, sheet = sheet, skip = skip)
}

(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#>         X__1 `Monthly return`
#>       <dttm>            <dbl>
#> 1 2017-03-14          0.00907
#> 2 2017-04-14          0.00069
#> 3 2017-05-14          0.01890
#> 4 2017-06-14          0.00803
#> 5 2017-07-14         -0.01998
#> 6 2017-08-14          0.00697
#> 
#> $sheet_two
#> # A tibble: 6 × 2
#>         X__1 `Monthly return`
#>       <dttm>            <dbl>
#> 1 2017-03-14          0.00907
#> 2 2017-04-14          0.00069
#> 3 2017-05-14          0.01890
#> 4 2017-06-14          0.00803
#> 5 2017-07-14         -0.01998
#> 6 2017-08-14          0.00697
like image 37
jennybryan Avatar answered Oct 02 '22 15:10

jennybryan


In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.

I have this Excel book: enter image description here

I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.

First: load data

library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           <NA>
2    GROSS PERFORMANCE DETAILS           <NA>
3 updated  by IG on 20/04/2017           <NA>
4                         <NA> Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Second: I added some cols with NA and '' values in the case that your data contain some

dat$x2 <- NA
dat$x4 <- NA
head(dat)

            MY.COMPANY.PTY.LTD            NA. x2 x4
1             MC  Pension Fund           <NA> NA NA
2    GROSS PERFORMANCE DETAILS           <NA> NA NA
3 updated  by IG on 20/04/2017           <NA> NA NA
4                         <NA> Monthly return NA NA
5                       Mar-14         0.0097 NA NA
6                       Apr-14          6e-04 NA NA

Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past

colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           <NA>
2    GROSS PERFORMANCE DETAILS           <NA>
3 updated  by IG on 20/04/2017           <NA>
4                         <NA> Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Fourth: Keep only rows with complete observations (it's what I supose from your example)

rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)

   MY.COMPANY.PTY.LTD     NA.
5              Mar-14  0.0097
6              Apr-14   6e-04
7              May-14  0.0189
8              Jun-14   0.008
9              Jul-14 -0.0199
10             Ago-14 0.00697

Finally if you want to keep the header you can make something like this:

colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])

or

colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3

    Month Monthly return
5  Mar-14         0.0097
6  Apr-14          6e-04
7  May-14         0.0189
8  Jun-14          0.008
9  Jul-14        -0.0199
10 Ago-14        0.00697
like image 30
gonzalez.ivan90 Avatar answered Oct 02 '22 17:10

gonzalez.ivan90