I'm trying to scrape an HTML table from a website using rvest. The only problem is that the table I'm trying to scrape doesn't have <tr>
tags, except on the first row. It looks like this:
<tr>
<td>6/21/2015 9:38 PM</td>
<td>5311 Lake Park</td>
<td>UCPD</td>
<td>African American</td>
<td>Male</td>
<td>Subject was causing a disturbance in the area.</td>
<td>Name checked; no further action</td>
<td>No</td>
</tr>
<td>6/21/2015 10:37 PM</td>
<td>5200 S Blackstone</td>
<td>UCPD</td>
<td>African American</td>
<td>Male</td>
<td>Subject was observed fighting in the McDonald's parking lot</td>
<td>Warned; released</td>
<td>No</td>
</tr>
And so on. So, using the following code, I'm only able to get the first row into my data frame:
library(rvest)
mydata <- html_session("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") %>%
html_node("table") %>%
html_table(header = TRUE, fill=TRUE)
How can I alter this to get html_table to understand that the rows are rows, even if they don't have an opening <tr>
tag? Or is there a better way to go about this?
library(rvest)
url_parse<- read_html("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015")
col_name<- url_parse %>%
html_nodes("th") %>%
html_text()
mydata <- url_parse %>%
html_nodes("td") %>%
html_text()
finaldata <- data.frame(matrix(mydata, ncol=7, byrow=TRUE))
names(finaldata) <- col_name
finaldata
Incident Location
Reported Occurred
1 Theft 1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM 5/31/15 to 6/1/15 8:00 PM to 12:00 PM
2 Information 5835 S. Kimbark 6/1/15 3:57 PM 6/1/15 3:55 PM
3 Information 1025 E. 58th St. (Swift) 6/2/15 2:18 AM 6/2/15 2:18 AM
4 Non-Criminal Damage to Property 850 E. 63rd St. (Car Wash) 6/2/15 8:48 AM 6/2/15 8:00 AM
5 Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure) 6/2/15 7:32 PM 6/2/15 6:45 PM to 7:30 PM
Comments / Nature of Fire Disposition
1 Bicycle secured to bike rack taken by unknown person Open
2 Unknown person used staff member's personal information to file a fraudulent claim with U.S. Social Security Admin. / CPD case CPD
3 Three unaffiliated individuals reported tampering with bicycles in bike rack / Subjects were given trespass warnings and sent on their way Closed
4 Rear wiper blade assembly damaged on UC owned vehicle during car wash Closed
5 Unknown person(s) spray painted graffiti on north concrete wall of the structure Open
UCPDI#
1 E00344
2 E00345
3 E00346
4 E00347
5 E00348
Slightly different approach than @user227710, but generally the same. This, similarly, exploits the fact that the number of TD
s is uniform.
However, this also grabs all the incidents (rbind
s each page into one incidents
data frame).
The pblapply
just gives you progress bars since this take a few seconds. Totally not necessary unless in an interactive session.
library(rvest)
library(stringr)
library(dplyr)
library(pbapply)
url <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
pg <- read_html(url)
pg %>%
html_nodes("li.page-count") %>%
html_text() %>%
str_trim() %>%
str_split(" / ") %>%
unlist %>%
as.numeric %>%
.[2] -> total_pages
pblapply(1:(total_pages), function(j) {
# get "column names"
# NOTE that you get legit column names for use with "regular"
# data frames this way
pg %>%
html_nodes("thead > tr > th") %>%
html_text() %>%
make.names -> tcols
# get all the TDs
pg %>%
html_nodes("td") %>%
as_list() -> tds
# how many rows do we have? (shld be 5, but you never know)
trows <- length(tds) / 7
# the basic idea is to grab all the TDs for each row
# then cbind them together and then rbind the whole thing
# while keeping decent column names
bind_rows(lapply(1:trows, function(i) {
setNames(cbind.data.frame(lapply(1:7, function(j) {
html_text(tds[[(i-1)*7 + j]])
}), stringsAsFactors=FALSE), tcols)
})) -> curr_tbl
# get next url
pg %>%
html_nodes("li.next > a") %>%
html_attr("href") -> next_url
if (j < total_pages) {
pg <<- read_html(sprintf("https://incidentreports.uchicago.edu/%s", next_url))
}
curr_tbl
}) %>% bind_rows -> incidents
incidents
## Source: local data frame [62 x 7]
##
## Incident Location Reported
## 1 Theft 1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM
## 2 Information 5835 S. Kimbark 6/1/15 3:57 PM
## 3 Information 1025 E. 58th St. (Swift) 6/2/15 2:18 AM
## 4 Non-Criminal Damage to Property 850 E. 63rd St. (Car Wash) 6/2/15 8:48 AM
## 5 Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure) 6/2/15 7:32 PM
## 6 Information / Aggravated Robbery 4701 S. Ellis (Public Way) 6/3/15 2:11 AM
## 7 Lost Property 5800 S. University (Main Quad) 6/3/15 8:30 AM
## 8 Criminal Damage to Property 5505 S. Ellis (Parking Structure) 5/29/15 5:00 PM
## 9 Information / Armed Robbery 6300 S. Cottage Grove (Public Way) 6/3/15 2:33 PM
## 10 Lost Property 1414 E. 59th St. (I-House) 6/3/15 2:28 PM
## .. ... ... ...
## Variables not shown: Occurred (chr), Comments...Nature.of.Fire (chr), Disposition (chr), UCPDI. (chr)
Thanks everyone! I ended up getting some help from another R user off line who suggested the following solution. It takes the html, saves it, adds in the <tr>
(much like @Bram Vanroy suggested), and turns it back into an html object, which can then be scraped into a dataframe.
library(rvest)
myurl <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
download.file(myurl, destfile="myfile.html", method="curl")
myhtml <- readChar("myfile.html", file.info("myfile.html")$size)
myhtml <- gsub("</tr>", "</tr><tr>", myhtml, fixed = TRUE)
mydata <- html(myhtml)
mydf <- mydata %>%
html_node("table") %>%
html_table(fill = TRUE)
mydf <- na.omit(mydf)
The last line is to omit some weird NA rows that show up with this method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With