Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Scraping an HTML table with rvest when there are missing <tr> tags

I'm trying to scrape an HTML table from a website using rvest. The only problem is that the table I'm trying to scrape doesn't have <tr> tags, except on the first row. It looks like this:

<tr> 
  <td>6/21/2015 9:38 PM</td>
  <td>5311 Lake Park</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was causing a disturbance in the area.</td>
  <td>Name checked; no further action</td>
  <td>No</td>
</tr>

  <td>6/21/2015 10:37 PM</td>
  <td>5200 S Blackstone</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was observed fighting in the McDonald's parking lot</td>
  <td>Warned; released</td>
  <td>No</td>
</tr>

And so on. So, using the following code, I'm only able to get the first row into my data frame:

library(rvest)
mydata <- html_session("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") %>%
    html_node("table") %>%
    html_table(header = TRUE, fill=TRUE)

How can I alter this to get html_table to understand that the rows are rows, even if they don't have an opening <tr> tag? Or is there a better way to go about this?

like image 610
jonahshai Avatar asked Jun 22 '15 20:06

jonahshai


3 Answers

library(rvest)

url_parse<- read_html("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") 

col_name<- url_parse %>%
  html_nodes("th") %>%
  html_text()

mydata <- url_parse %>%
  html_nodes("td") %>%
  html_text()

finaldata <- data.frame(matrix(mydata, ncol=7, byrow=TRUE))

names(finaldata) <- col_name

finaldata

                     Incident                                  Location    

    Reported                              Occurred
1                           Theft       1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM 5/31/15 to 6/1/15 8:00 PM to 12:00 PM
2                     Information                          5835 S. Kimbark   6/1/15 3:57 PM                        6/1/15 3:55 PM
3                     Information                  1025 E. 58th St. (Swift)  6/2/15 2:18 AM                        6/2/15 2:18 AM
4 Non-Criminal Damage to Property                850 E. 63rd St. (Car Wash)  6/2/15 8:48 AM                        6/2/15 8:00 AM
5     Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure)  6/2/15 7:32 PM             6/2/15 6:45 PM to 7:30 PM
                                                                                                                   Comments / Nature of Fire Disposition
1                                                                                       Bicycle secured to bike rack taken by unknown person        Open
2             Unknown person used staff member's personal information to file a fraudulent claim with U.S. Social Security Admin. / CPD case         CPD
3 Three unaffiliated individuals reported tampering with bicycles in bike rack / Subjects were given trespass warnings and sent on their way      Closed
4                                                                      Rear wiper blade assembly damaged on UC owned vehicle during car wash      Closed
5                                                           Unknown person(s) spray painted graffiti on north concrete wall of the structure        Open
  UCPDI#
1 E00344
2 E00345
3 E00346
4 E00347
5 E00348
like image 84
user227710 Avatar answered Nov 15 '22 03:11

user227710


Slightly different approach than @user227710, but generally the same. This, similarly, exploits the fact that the number of TDs is uniform.

However, this also grabs all the incidents (rbinds each page into one incidents data frame).

The pblapply just gives you progress bars since this take a few seconds. Totally not necessary unless in an interactive session.

library(rvest)
library(stringr)
library(dplyr)
library(pbapply)

url <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
pg <- read_html(url)

pg %>% 
  html_nodes("li.page-count") %>% 
  html_text() %>% 
  str_trim() %>% 
  str_split(" / ") %>%
  unlist %>% 
  as.numeric %>% 
  .[2] -> total_pages

pblapply(1:(total_pages), function(j) {

  # get "column names"
  # NOTE that you get legit column names for use with "regular" 
  # data frames this way

  pg %>% 
    html_nodes("thead > tr > th") %>% 
    html_text() %>% 
    make.names -> tcols

  # get all the TDs

  pg %>% 
    html_nodes("td") %>%
    as_list() -> tds

  # how many rows do we have? (shld be 5, but you never know)

  trows <- length(tds) / 7

  # the basic idea is to grab all the TDs for each row
  # then cbind them together and then rbind the whole thing
  # while keeping decent column names

  bind_rows(lapply(1:trows, function(i) {
    setNames(cbind.data.frame(lapply(1:7, function(j) { 
      html_text(tds[[(i-1)*7 + j]])
    }), stringsAsFactors=FALSE), tcols)
  })) -> curr_tbl

  # get next url

  pg %>% 
    html_nodes("li.next > a") %>% 
    html_attr("href") -> next_url

  if (j < total_pages) {
    pg <<- read_html(sprintf("https://incidentreports.uchicago.edu/%s", next_url))
  }

  curr_tbl

}) %>% bind_rows -> incidents

incidents

## Source: local data frame [62 x 7]
## 
##                            Incident                                  Location        Reported
## 1                             Theft       1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM
## 2                       Information                          5835 S. Kimbark   6/1/15 3:57 PM
## 3                       Information                  1025 E. 58th St. (Swift)  6/2/15 2:18 AM
## 4   Non-Criminal Damage to Property                850 E. 63rd St. (Car Wash)  6/2/15 8:48 AM
## 5       Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure)  6/2/15 7:32 PM
## 6  Information / Aggravated Robbery                4701 S. Ellis (Public Way)  6/3/15 2:11 AM
## 7                     Lost Property           5800 S. University  (Main Quad)  6/3/15 8:30 AM
## 8       Criminal Damage to Property         5505 S. Ellis (Parking Structure) 5/29/15 5:00 PM
## 9       Information / Armed Robbery        6300 S. Cottage Grove (Public Way)  6/3/15 2:33 PM
## 10                    Lost Property                1414 E. 59th St. (I-House)  6/3/15 2:28 PM
## ..                              ...                                       ...             ...
## Variables not shown: Occurred (chr), Comments...Nature.of.Fire (chr), Disposition (chr), UCPDI. (chr)
like image 25
hrbrmstr Avatar answered Nov 15 '22 05:11

hrbrmstr


Thanks everyone! I ended up getting some help from another R user off line who suggested the following solution. It takes the html, saves it, adds in the <tr> (much like @Bram Vanroy suggested), and turns it back into an html object, which can then be scraped into a dataframe.

library(rvest)
myurl <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
download.file(myurl, destfile="myfile.html", method="curl")
myhtml <- readChar("myfile.html", file.info("myfile.html")$size)
myhtml <- gsub("</tr>", "</tr><tr>", myhtml, fixed = TRUE)
mydata <- html(myhtml)

mydf <- mydata %>%
  html_node("table") %>%
  html_table(fill = TRUE)

mydf <- na.omit(mydf)

The last line is to omit some weird NA rows that show up with this method.

like image 20
jonahshai Avatar answered Nov 15 '22 05:11

jonahshai