I am using readxl library to read many excel worksheets in the same excel workbook (called data.xlsx) with the following format:
Data starts in row 3.
row1
row2
companyName 1980 1981 1982 ... 2016
company1 5 6 7 8
company2 10 20 30 40
company3 20 40 60 80
....
The data range is different in length by each row and column. However, they have the companyName as common key. The year range varies from starting from 1980 or 1990 until 2016. The worksheet name is the data name.
I want to create a single excel where all data are extracted from all worksheets.
companyName Year dataname values
company1 1980 sheetname1 5
company1 1981 sheetname1 6
company1 1982 sheetname1 7
company1 ... sheetname1 ...
company1 2016 sheetname1 8
company2 1980 sheetname1 10
company2 1981 sheetname1 20
company2 1982 sheetname1 30
company2 ... sheetname1 ...
company2 2016 sheetname1 40
.... .... ... ...
company1 2000 sheetname2 xxx
company1 2001 sheetname2 yyy
etc
etc
etc
This is how far I managed to get too:
library(tidyverse)
library(readxl)
library(data.table)
#read excel file (from [here][1])
file.list<-"data.xlsx"
**#read all sheets (and **skip** first two rows)**
df.list <- lapply(file.list,function(x) {
sheets <- excel_sheets(x)
dfs <- lapply(sheets, function(y) {
read_excel(x, sheet = y,skip=2)
})
names(dfs) <- sheets
dfs
})
I have following issues:
Thank you for your help.
Source: R: reading multiple excel files, extract first sheet names, and create new column
There are various external packages in R used to read XLSX files with multiple sheets. Initially, the excel_sheets() method is invoked to fetch all the worksheet names contained in the Excel workbook, with the specified file path.
For importing multiple Excel sheets into R, we have to, first install a package in R which is known as readxl. After successfully installing the package, we have to load the package using the library function is R.
With skip , you can tell R to ignore a specified number of rows inside the Excel sheets you're trying to pull data from. Have a look at this example: read_excel("data.xlsx", skip = 15) In this case, the first 15 rows in the first sheet of "data. xlsx" are ignored.
To read an Excel file into R we have to pass its path as an argument to read_excel() function readxl library. To select a specific column we can use indexing.
I just removed one level of nesting from df.list
.
df.list <- lapply(file.list,function(x) {
sheets <- excel_sheets(x)
dfs <- lapply(sheets, function(y) {
read_excel(x, sheet = y,skip=2)
})
names(dfs) <- sheets
dfs
})[[1]]
This works for me. I can't replicate your problem with skips. Also, if the rows are just blank rows, read_excel()
should skip them by default using trim_ws = TRUE
.
I used the following list just to demonstrate what to do after the import.
df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1",
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6,
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980",
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1",
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7,
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980",
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1",
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9,
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990",
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2",
"sheetname3"))
The following works even if the years start at 1980 or 1990.
dat <- lapply(df.list, function(x){
nrows = nrow(x)
years = names(x[,2:nrows])
x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()
dat
# # A tibble: 27 x 4
# name companyName year values
# <chr> <chr> <chr> <dbl>
# 1 sheetname1 company1 1980 5.
# 2 sheetname1 company2 1980 10.
# 3 sheetname1 company3 1980 40.
# 4 sheetname1 company1 1981 6.
# 5 sheetname1 company2 1981 20.
# 6 sheetname1 company3 1981 50.
# 7 sheetname1 company1 1982 7.
# 8 sheetname1 company2 1982 30.
# 9 sheetname1 company3 1982 60.
# 10 sheetname2 company1 1980 6.
# # ... with 17 more rows
You can now use the specific sheetname
by using dplyr::filter()
.
For example:
dat %>% filter(name == "sheetname1")
# name companyName year values
# <chr> <chr> <chr> <dbl>
# 1 sheetname1 company1 1980 5.
# 2 sheetname1 company2 1980 10.
# 3 sheetname1 company3 1980 40.
# 4 sheetname1 company1 1981 6.
# 5 sheetname1 company2 1981 20.
# 6 sheetname1 company3 1981 50.
# 7 sheetname1 company1 1982 7.
# 8 sheetname1 company2 1982 30.
# 9 sheetname1 company3 1982 60.
I would recommend the package openxlsx
which allows you to specify startRow
, and melt
from the package reshape2
which enables to change a data frame to a long format in an easy manner.
library(openxlsx)
library(reshape2)
first.Row <- 6 # supposing the data starts at row 6
sheets.2.read <- loadWorkbook(file.list)$sheet_names # retrieving the sheet names
df <- data.frame()
for(tmp.sheet in sheets.2.read){
tmp.dat <- read.xlsx(file.list, sheet = tmp.sheet, startRow = first.Row, colNames = TRUE)
tmp.dat <- cbind(melt(tmp.dat, id.vars = "companyName"), tmp.sheet)
df <- rbind(df, tmp.dat)
}
Here is my output with some dummy data (printing only 10 rows):
> df[c(1:3, 50:53, 300:302),]
company.name variable value tmp.sheet
1 comp7 1968 0.3359298 Sheet1
2 comp8 1968 0.3359298 Sheet1
3 comp9 1968 0.3359298 Sheet1
50 comp16 1966 0.3359298 Sheet2
51 comp17 1966 0.3359298 Sheet2
52 comp18 1966 0.3359298 Sheet2
53 comp19 1966 0.3359298 Sheet2
300 comp16 2000 0.3359298 Sheet3
301 comp17 2000 0.3359298 Sheet3
302 comp18 2000 0.3359298 Sheet3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With