I'm importing xlsx
2007 tables into R 3.2.1patched
using package readxl 0.1.0
under Windows 7 64
. The tables' size is on the order of 25,000 rows by 200 columns.
Function read_excel()
works a treat. My only problem is with its assignment of column class (datatype) to sparsely populated columns. For example, a given column may be NA for 20,000 rows and then will take a character value on row 20,001. read_excel()
appears to default to column type numeric when scanning the first n rows of a column and finding NAs
only. The data causing the problem are chars in a column assigned numeric. When the error limit is reached, execution halts. I actually want the data in the sparse columns, so setting the error limit higher isn't a solution.
I can identify the troublesome columns by reviewing the warnings thrown. And read_excel()
has an option for asserting a column's datatype by setting argument col_types
according to the package docs:
Either NULL
to guess from the spreadsheet or a character vector containing blank
,numeric
, date
or text
.
But does this mean I have to construct a vector of length 200 populated in almost every position with blank
and text
in handful of positions corresponding to the offending columns?
There's probably a way of doing this in a couple lines of R
code. Create a vector of the required length and fill it with blank
s. Maybe another vector containing the numbers of the columns to be forced to text
, and then ... Or maybe it's possible to call out for read_excel()
just the columns for which its guesses aren't as desired.
I'd appreciate any suggestions.
Thanks in advance.
The readxl package makes it easy to get data out of Excel and into R. Compared to many of the existing packages (e.g. gdata, xlsx, xlsReadWrite) readxl has no external dependencies, so it's easy to install and use on all operating systems. It is designed to work with tabular data. readxl supports both the legacy .
Difference between read_xlsx() and read_excel() functions The result for reading the . xlsx file with the two functions is same. The only difference is that when read_excel() is used, excel_format() is called internally by the read_excel() function to determine if the path is xls or xlsx file from the file extension.
New solution since readxl
version 1.x:
The solution in the currently preferred answer does no longer work with newer versions than 0.1.0 of readxl
since the used package-internal function readxl:::xlsx_col_types
does no longer exist.
The new solution is to use the newly introduced parameter guess_max
to increase the number of rows used to "guess" the appropriate data type of the columns:
read_excel("My_Excel_file.xlsx", sheet = 1, guess_max = 1048576)
The value 1,048,576 is the maximum number of lines supported by Excel currently, see the Excel specs: https://support.office.com/en-us/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
PS: If you care about performance using all rows to guess the data type: read_excel
seems to read the file only once and the guess is done in-memory then so the performance penalty is very small compared to the saved work.
It depends on whether your data is sparse in different places in different columns, and how sparse it is. I found that having more rows didn't improve the parsing: the majority were still blank, and interpreted as text, even if later on they become dates, etc..
One work-around is to generate the first data row of your excel table to include representative data for every column, and use that to guess column types. I don't like this because I want to leave the original data intact.
Another workaround, if you have complete rows somewhere in the spreadsheet, is to use nskip
instead of n
. This gives the starting point for the column guessing. Say data row 117 has a full set of data:
readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)
Note that you can call the function directly, without having to edit the function in the namespace.
You can then use the vector of spreadsheet types to call read_excel:
col_types <- readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)
dat <- readxl::read_excel(path = "a.xlsx", col_types = col_types)
Then you can manually update any columns which it still gets wrong.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With