I am scraping data from wikipedia and putting it into a data.frame. However, I cannot subset the resulting data.frame.
If I use dput to reload the data into another variable then the subset works fine. I am unclear if I am doing something wrong, or if there is a bug somewhere in R or one of the packages I am using. Here is a reproducible example.
Step 1: Load the data into reps
library(rvest)
library(xml2)
url = "https://en.m.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives"
file = xml2::read_html(url)
tables = rvest::html_nodes(file, "table")
reps = rvest::html_table(tables[6])
reps = as.data.frame(reps)[1,1:3]
reps$District
# [1] "Alabama 1"
# I expected this line to return TRUE
reps$District == "Alabama 1"
# [1] FALSE
# Because the above line returns FALSE, this code returns an empty data.frame
reps[reps$District=="Alabama 1",]
# [1] District Member Party
# <0 rows> (or 0-length row.names)
What is particularly strange is that if I use dput to write and reload the data, the subset works fine:
dput(reps)
# structure(list(District = "Alabama 1", Member = "Bradley Byrne",
# Party = NA), row.names = 1L, class = "data.frame")
x=structure(list(District = "Alabama 1", Member = "Bradley Byrne",
Party = NA), row.names = 1L, class = "data.frame")
# now it's TRUE!
x$District=="Alabama 1"
# [1] TRUE
# and so the subset works
x[x$District == "Alabama 1", ]
# District Member Party
# 1 Alabama 1 Bradley Byrne NA
I believe that I am using the latest version of R and all the packages:
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] httr_1.4.2 compiler_4.0.2 selectr_0.4-2 magrittr_1.5 R6_2.4.1 tools_4.0.2
[7] curl_4.3 xml2_1.3.2 stringi_1.4.6 stringr_1.4.0 rvest_0.3.6
\u00A0 will match the , so you can replace all of those with gsub.
reps$District <- gsub("\u00A0", " ", reps$District, fixed = TRUE)
Full code:
library(rvest)
library(xml2)
url = "https://en.m.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives"
file = xml2::read_html(url)
tables = rvest::html_nodes(file, "table")
reps = rvest::html_table(tables[6])
reps = as.data.frame(reps)[1,1:3]
reps$District <- gsub("\u00A0", " ", reps$District, fixed = TRUE)
reps$District == "Alabama 1"
# [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With