Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Webscraping an image in a column with R

Tags:

r

web-scraping

To clarify, this data shows information for active and historical fires in British Columbia. So far, I've successfully been able to pull all of the data out of the HTML table using the following code:

 Interface_html <- html_nodes(webpage,'td:nth-child(1)')
 Interface_data <- html_text(Interface_html)
 head(Interface_data)

 (...)


Geocoding_df<-data.frame(Fire_no = Fire_no_data, Geographic = 
Geographic_data, Discovery = Discovery_Date_data, Status = Status_data,
Hectares = Hectares_data, Interface = Interface_data, Updatetime = 
Updatetime_data, Updatetime_stg = Updatetime_data_stg)

However, in the first column some rows contain an image of a small house. This image acts as an indicator that the fire is an 'interface' fire, meaning that it is threatening structures.

Basically, I need a way to pull whether or not the image is present in the row, (ideally the image alt text "Interface" but even a yes/no indicator would be fine for my purposes.

Is there a way to pull the image properties from this table by modifying the code that I've already got?

The main purpose, is that I want to pull the entire table into SQL for some data visualization using PowerBI.

Include a screenshot: enter image description here

The website: http://bcfireinfo.for.gov.bc.ca/hprScripts/WildfireNews/Fires.asp?Mode=normal&AllFires=1&FC=0

like image 973
nimavancouver Avatar asked Nov 19 '25 14:11

nimavancouver


1 Answers

The variable "Interface_html" is a list of all of the lines from the webpage. So one method is to look at each node to see if it contains an img tag. html_node (without the s) will always return a result whether or not it is successful. In this case html_node(Interface_html, "img") will return NA if the does not exist, otherwise it will return the html code.

library(rvest)

url<-"http://bcfireinfo.for.gov.bc.ca/hprScripts/WildfireNews/Fires.asp?Mode=normal&AllFires=1&FC=0"
webpage<-read_html(url)

#list of all nodes
Interface_html <- html_nodes(webpage,'td:nth-child(1)')

#search each node in list to see if it contains an image tag and return node number.
withimage<- which(!is.na(html_node(Interface_html, "img")))

withimage
#[1] 109 145


#to add the column of True/Falses onto your dataframe use:
Interface = !is.na(html_node(Interface_html, "img"))
like image 73
Dave2e Avatar answered Nov 21 '25 11:11

Dave2e



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!