Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skip rows while reading multiple excel worksheets in R

Tags:

r

I am using readxl library to read many excel worksheets in the same excel workbook (called data.xlsx) with the following format:

Data starts in row 3.

  row1
  row2
 companyName   1980    1981    1982 ... 2016
 company1       5       6       7        8
 company2       10      20      30       40
 company3       20      40      60       80
 ....

The data range is different in length by each row and column. However, they have the companyName as common key. The year range varies from starting from 1980 or 1990 until 2016. The worksheet name is the data name.

I want to create a single excel where all data are extracted from all worksheets.

 companyName   Year   dataname     values
 company1      1980   sheetname1     5
 company1      1981   sheetname1     6
 company1      1982   sheetname1     7
 company1      ...    sheetname1     ...
 company1      2016   sheetname1     8
 company2      1980   sheetname1     10
 company2      1981   sheetname1     20
 company2      1982   sheetname1     30
 company2      ...    sheetname1     ...
 company2      2016   sheetname1     40
 ....          ....     ...           ...
 company1      2000    sheetname2     xxx
 company1      2001    sheetname2     yyy
  etc
  etc
  etc

This is how far I managed to get too:

  library(tidyverse)
  library(readxl)
  library(data.table)

   #read excel file (from [here][1])
   file.list<-"data.xlsx"

     **#read all sheets (and **skip** first two rows)**

   df.list <- lapply(file.list,function(x) {
     sheets <- excel_sheets(x)
     dfs <- lapply(sheets, function(y) {
       read_excel(x, sheet = y,skip=2)
       })
     names(dfs) <- sheets
     dfs
   })

I have following issues:

  • the first two rows are not been skipped
  • how I create one dataframe with only select sheets only (ie. sheet 5, sheet 10 and sheet 15).

Thank you for your help.

Source: R: reading multiple excel files, extract first sheet names, and create new column

like image 861
Beginner Avatar asked Mar 06 '18 09:03

Beginner


People also ask

Can R read Excel file with multiple sheets?

There are various external packages in R used to read XLSX files with multiple sheets. Initially, the excel_sheets() method is invoked to fetch all the worksheet names contained in the Excel workbook, with the specified file path.

How do I load multiple Excel sheets in R?

For importing multiple Excel sheets into R, we have to, first install a package in R which is known as readxl. After successfully installing the package, we have to load the package using the library function is R.

What does skip mean r?

With skip , you can tell R to ignore a specified number of rows inside the Excel sheets you're trying to pull data from. Have a look at this example: read_excel("data.xlsx", skip = 15) In this case, the first 15 rows in the first sheet of "data. xlsx" are ignored.

How do I read a specific column in Excel in R?

To read an Excel file into R we have to pass its path as an argument to read_excel() function readxl library. To select a specific column we can use indexing.


2 Answers

I just removed one level of nesting from df.list.

df.list <- lapply(file.list,function(x) {
    sheets <- excel_sheets(x)
    dfs <- lapply(sheets, function(y) {
    read_excel(x, sheet = y,skip=2)
  })
  names(dfs) <- sheets
  dfs 
})[[1]]

This works for me. I can't replicate your problem with skips. Also, if the rows are just blank rows, read_excel() should skip them by default using trim_ws = TRUE.

I used the following list just to demonstrate what to do after the import.

df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6, 
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7, 
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1", 
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9, 
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990", 
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2", 
"sheetname3"))

The following works even if the years start at 1980 or 1990.

dat <- lapply(df.list, function(x){
  nrows = nrow(x)
  years = names(x[,2:nrows])
  x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()

dat

# # A tibble: 27 x 4
#    name       companyName year  values
#    <chr>      <chr>       <chr>  <dbl>
#  1 sheetname1 company1    1980      5.
#  2 sheetname1 company2    1980     10.
#  3 sheetname1 company3    1980     40.
#  4 sheetname1 company1    1981      6.
#  5 sheetname1 company2    1981     20.
#  6 sheetname1 company3    1981     50.
#  7 sheetname1 company1    1982      7.
#  8 sheetname1 company2    1982     30.
#  9 sheetname1 company3    1982     60.
# 10 sheetname2 company1    1980      6.
# # ... with 17 more rows

You can now use the specific sheetname by using dplyr::filter().

For example:

dat %>% filter(name == "sheetname1")

#   name       companyName year  values
#   <chr>      <chr>       <chr>  <dbl>
# 1 sheetname1 company1    1980      5.
# 2 sheetname1 company2    1980     10.
# 3 sheetname1 company3    1980     40.
# 4 sheetname1 company1    1981      6.
# 5 sheetname1 company2    1981     20.
# 6 sheetname1 company3    1981     50.
# 7 sheetname1 company1    1982      7.
# 8 sheetname1 company2    1982     30.
# 9 sheetname1 company3    1982     60.
like image 108
hpesoj626 Avatar answered Sep 28 '22 01:09

hpesoj626


I would recommend the package openxlsx which allows you to specify startRow, and melt from the package reshape2 which enables to change a data frame to a long format in an easy manner.

library(openxlsx)
library(reshape2)

first.Row <- 6 # supposing the data starts at row 6
sheets.2.read <- loadWorkbook(file.list)$sheet_names # retrieving the sheet names
df <- data.frame()
for(tmp.sheet in sheets.2.read){
  tmp.dat <- read.xlsx(file.list, sheet = tmp.sheet, startRow = first.Row, colNames = TRUE)
  tmp.dat <- cbind(melt(tmp.dat, id.vars = "companyName"), tmp.sheet)
  df <- rbind(df, tmp.dat)
}

Here is my output with some dummy data (printing only 10 rows):

> df[c(1:3, 50:53, 300:302),]
    company.name variable     value tmp.sheet
1          comp7     1968 0.3359298    Sheet1
2          comp8     1968 0.3359298    Sheet1
3          comp9     1968 0.3359298    Sheet1
50        comp16     1966 0.3359298    Sheet2
51        comp17     1966 0.3359298    Sheet2
52        comp18     1966 0.3359298    Sheet2
53        comp19     1966 0.3359298    Sheet2
300       comp16     2000 0.3359298    Sheet3
301       comp17     2000 0.3359298    Sheet3
302       comp18     2000 0.3359298    Sheet3
like image 45
niko Avatar answered Sep 28 '22 01:09

niko