Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML data using R

I have a html data set as below, which I want to parse and convert into a tabular format which I can use .

<!DOCTYPE html>
<html>

<head>
    <title>Page Title</title>
</head>

<body>
    <div class="brewery" id="brewery">
        <ul class="vcard simple">
            <li class="name"> Bradley Farm / RB Brew, LLC</li>
            <li class="address">317 Springtown Rd </li>
            <li class="address_2">New Paltz, NY 12561-3020 | <a href='http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States' target='_blank'>Map</a> </li>
            <li class="telephone">Phone: (845) 255-8769</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
    <div class="brewery">
        <ul class="vcard simple">
            <li class="name">(405) Brewing Co</li>
            <li class="address">1716 Topeka St </li>
            <li class="address_2">Norman, OK 73069-8224 | <a href='http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States' target='_blank'>Map</a> </li>
            <li class="telephone">Phone: (405) 816-0490</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
</body>

Below is the code which I have used. The issue I am facing is it converts into text file using Rvest but cant seem to make it of any useful format.

library(dplyr)
library(rvest)

url<-html("beer.html")
selector_name<-".brewery"
fnames<-html_nodes(x = url, css = selector_name) %>%
html_text()
head(fnames)
fnames

Would this be a correct approach or should I be doing it using some other package to go through each div and the inner elements.

The out put I would like to see it is

No.  Name  Address Type Website

Thank You.

like image 628
SNT Avatar asked Dec 09 '25 12:12

SNT


2 Answers

library(rvest)
library(dplyr)

html_file <- '<!DOCTYPE html>
<html>

<head>
    <title>Page Title</title>
</head>

<body>
    <div class="brewery" id="brewery">
        <ul class="vcard simple">
            <li class="name"> Bradley Farm / RB Brew, LLC</li>
            <li class="address">317 Springtown Rd </li>
            <li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li>
            <li class="telephone">Phone: (845) 255-8769</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
    <div class="brewery">
        <ul class="vcard simple">
            <li class="name">(405) Brewing Co</li>
            <li class="address">1716 Topeka St </li>
            <li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li>
            <li class="telephone">Phone: (405) 816-0490</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
</body>'

page <- read_html(html_file) 

tibble(
  name = page %>% html_nodes(".vcard .name") %>% html_text(),
  address = page %>% html_nodes(".vcard .address") %>% html_text(),
  type = page %>% html_nodes(".vcard .brewery_type") %>% html_text() %>% stringr::str_replace_all("^Type: ", ""),
  website = page %>% html_nodes(".vcard .url a") %>% html_attr("href")
)

#> # A tibble: 2 x 4
#>                           name            address  type                       website
#>                          <chr>              <chr> <chr>                         <chr>
#> 1  Bradley Farm / RB Brew, LLC 317 Springtown Rd  Micro http://www.raybradleyfarm.com
#> 2             (405) Brewing Co    1716 Topeka St  Micro     http://www.405brewing.com
like image 94
austensen Avatar answered Dec 12 '25 03:12

austensen


The problem is that it's not a table, so it's not super easy to parse. It's just two lists, which the below code concatenates into one list. Also FYI, try looking into the xml2 package for parsing html/xml.

library(dplyr)
library(rvest)
library(xml2)

vcard <- 
  '<!DOCTYPE html>
  <html>

  <head>
  <title>Page Title</title>
  </head>

  <body>
  <div class="brewery" id="brewery">
  <ul class="vcard simple">
  <li class="name"> Bradley Farm / RB Brew, LLC</li>
  <li class="address">317 Springtown Rd </li>
  <li class="address_2">New Paltz, NY 12561-3020 | <a href=\'http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States\' target=\'_blank\'>Map</a> </li>
  <li class="telephone">Phone: (845) 255-8769</li>
  <li class="brewery_type">Type: Micro</li>
  <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
  </ul>
  <ul class="vcard simple col2"></ul>
  </div>
  <div class="brewery">
  <ul class="vcard simple">
  <li class="name">(405) Brewing Co</li>
  <li class="address">1716 Topeka St </li>
  <li class="address_2">Norman, OK 73069-8224 | <a href=\'http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States\' target=\'_blank\'>Map</a> </li>
  <li class="telephone">Phone: (405) 816-0490</li>
  <li class="brewery_type">Type: Micro</li>
  <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
  </ul>
  <ul class="vcard simple col2"></ul>
  </div>
  </body>' %>% 
    read_html(html) %>% 
    xml_find_all("//ul[@class = 'vcard simple']")

two_children <- sapply(vcard, function(x) xml2::xml_children(x))

data.frame(
  class = sapply(two_children, function(x) xml2::xml_attrs(x)),
  value = sapply(two_children, function(x) xml2::xml_text(x)),
  stringsAsFactors = FALSE
)
like image 43
Chris Avatar answered Dec 12 '25 01:12

Chris



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!