I'm trying to scrape an HTML table from a website using rvest. The only problem is that the table I'm trying to scrape doesn't have <tr> tags, except on the first row. It looks like this:
<tr>
<td>6/21/2015 9:38 PM</td>
<td>5311 Lake Park</td>
<td>UCPD</td>
<td>African American</td>
<td>Male</td>
<td>Subject was causing a disturbance in the area.</td>
<td>Name checked; no further action</td>
<td>No</td>
</tr>
<td>6/21/2015 10:37 PM</td>
<td>5200 S Blackstone</td>
<td>UCPD</td>
<td>African American</td>
<td>Male</td>
<td>Subject was observed fighting in the McDonald's parking lot</td>
<td>Warned; released</td>
<td>No</td>
</tr>
And so on. So, using the following code, I'm only able to get the first row into my data frame:
library(rvest)
mydata <- html_session("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") %>%
html_node("table") %>%
html_table(header = TRUE, fill=TRUE)
How can I alter this to get html_table to understand that the rows are rows, even if they don't have an opening <tr> tag? Or is there a better way to go about this?
library(rvest)
url_parse<- read_html("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015")
col_name<- url_parse %>%
html_nodes("th") %>%
html_text()
mydata <- url_parse %>%
html_nodes("td") %>%
html_text()
finaldata <- data.frame(matrix(mydata, ncol=7, byrow=TRUE))
names(finaldata) <- col_name
finaldata
Incident Location
Reported Occurred
1 Theft 1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM 5/31/15 to 6/1/15 8:00 PM to 12:00 PM
2 Information 5835 S. Kimbark 6/1/15 3:57 PM 6/1/15 3:55 PM
3 Information 1025 E. 58th St. (Swift) 6/2/15 2:18 AM 6/2/15 2:18 AM
4 Non-Criminal Damage to Property 850 E. 63rd St. (Car Wash) 6/2/15 8:48 AM 6/2/15 8:00 AM
5 Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure) 6/2/15 7:32 PM 6/2/15 6:45 PM to 7:30 PM
Comments / Nature of Fire Disposition
1 Bicycle secured to bike rack taken by unknown person Open
2 Unknown person used staff member's personal information to file a fraudulent claim with U.S. Social Security Admin. / CPD case CPD
3 Three unaffiliated individuals reported tampering with bicycles in bike rack / Subjects were given trespass warnings and sent on their way Closed
4 Rear wiper blade assembly damaged on UC owned vehicle during car wash Closed
5 Unknown person(s) spray painted graffiti on north concrete wall of the structure Open
UCPDI#
1 E00344
2 E00345
3 E00346
4 E00347
5 E00348
Slightly different approach than @user227710, but generally the same. This, similarly, exploits the fact that the number of TDs is uniform.
However, this also grabs all the incidents (rbinds each page into one incidents data frame).
The pblapply just gives you progress bars since this take a few seconds. Totally not necessary unless in an interactive session.
library(rvest)
library(stringr)
library(dplyr)
library(pbapply)
url <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
pg <- read_html(url)
pg %>%
html_nodes("li.page-count") %>%
html_text() %>%
str_trim() %>%
str_split(" / ") %>%
unlist %>%
as.numeric %>%
.[2] -> total_pages
pblapply(1:(total_pages), function(j) {
# get "column names"
# NOTE that you get legit column names for use with "regular"
# data frames this way
pg %>%
html_nodes("thead > tr > th") %>%
html_text() %>%
make.names -> tcols
# get all the TDs
pg %>%
html_nodes("td") %>%
as_list() -> tds
# how many rows do we have? (shld be 5, but you never know)
trows <- length(tds) / 7
# the basic idea is to grab all the TDs for each row
# then cbind them together and then rbind the whole thing
# while keeping decent column names
bind_rows(lapply(1:trows, function(i) {
setNames(cbind.data.frame(lapply(1:7, function(j) {
html_text(tds[[(i-1)*7 + j]])
}), stringsAsFactors=FALSE), tcols)
})) -> curr_tbl
# get next url
pg %>%
html_nodes("li.next > a") %>%
html_attr("href") -> next_url
if (j < total_pages) {
pg <<- read_html(sprintf("https://incidentreports.uchicago.edu/%s", next_url))
}
curr_tbl
}) %>% bind_rows -> incidents
incidents
## Source: local data frame [62 x 7]
##
## Incident Location Reported
## 1 Theft 1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM
## 2 Information 5835 S. Kimbark 6/1/15 3:57 PM
## 3 Information 1025 E. 58th St. (Swift) 6/2/15 2:18 AM
## 4 Non-Criminal Damage to Property 850 E. 63rd St. (Car Wash) 6/2/15 8:48 AM
## 5 Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure) 6/2/15 7:32 PM
## 6 Information / Aggravated Robbery 4701 S. Ellis (Public Way) 6/3/15 2:11 AM
## 7 Lost Property 5800 S. University (Main Quad) 6/3/15 8:30 AM
## 8 Criminal Damage to Property 5505 S. Ellis (Parking Structure) 5/29/15 5:00 PM
## 9 Information / Armed Robbery 6300 S. Cottage Grove (Public Way) 6/3/15 2:33 PM
## 10 Lost Property 1414 E. 59th St. (I-House) 6/3/15 2:28 PM
## .. ... ... ...
## Variables not shown: Occurred (chr), Comments...Nature.of.Fire (chr), Disposition (chr), UCPDI. (chr)
Thanks everyone! I ended up getting some help from another R user off line who suggested the following solution. It takes the html, saves it, adds in the <tr> (much like @Bram Vanroy suggested), and turns it back into an html object, which can then be scraped into a dataframe.
library(rvest)
myurl <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
download.file(myurl, destfile="myfile.html", method="curl")
myhtml <- readChar("myfile.html", file.info("myfile.html")$size)
myhtml <- gsub("</tr>", "</tr><tr>", myhtml, fixed = TRUE)
mydata <- html(myhtml)
mydf <- mydata %>%
html_node("table") %>%
html_table(fill = TRUE)
mydf <- na.omit(mydf)
The last line is to omit some weird NA rows that show up with this method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With