Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping the make/model/year of VIN numbers in RStudio

I am currently working on a project where I need to find the manufacturer, model, and year of VIN numbers. I have a list of 300 different VIN numbers. Going through each individual VIN number and manually inputting the manufacturer, model, and year into excel is very inefficient and tedious.

I have tried using the Rvest packages with SelectorGadget to write a few lines of code in R in order to scrape this site to obtain the information but I was not successful: http://www.vindecoder.net/?vin=1G2HX54K724118697&submit=Decode

Here is my code:

library("rvest")
Vnum = "1G2HX54K724118697"
site <- paste("http://www.vindecoder.net/?vin=", Vnum,"&submit=Decode",sep="")
htmlpage <- html(site)
VINhtml <- html_nodes(htmlpage, ".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")
VIN <- html_text(forecasthtml)
paste(forecast, collapse =" ")

When I try to run VINhtml, I get the error message: list() attr(,"class") [1] "XMLNodeSet"

I do not know what I am doing wrong. I think it is not working because it is a dynamic webpage but I could be wrong. Does anyone have any suggestions on the best way to approach this problem?

I am also open to using other websites or alternative approaches to figuring this out. I just want to find the model, manufacturer, and model year of these VINs. Can anyone please help me in finding an efficient way of doing this?

Here is some sample VINs: YV4SZ592561226129 YV4SZ592371288470 YV4SZ592371257784 YV4CZ982871331598 YV4CZ982581428985 YV4CZ982481423003 YV4CZ982381423543 YV4CZ982171380593 YV4CZ982081460887 YV4CZ852361288222 YV4CZ852281454409 YV4CZ852281454409 YV4CZ852281454409 YV4CZ592861304665 YV4CZ592861267682 YV4CZ592561266859

like image 838
radaley1906 Avatar asked Jun 11 '15 11:06

radaley1906


Video Answer


1 Answers

Here is the solution using RSelenium and rvest.

To run RSelenium, you have to first download selenium server from here (Mine is 2.45 version). Let's say the downloaded file is in My Documents directory. Then, you have to run following two steps in cmd before running RSelenium in IDE.
Type following in cmd: a) cd My Documents # I have selenium driver installed in My Documents folder b) and then type: java -jar selenium-server-standalone-2.45.0.jar

library(RSelenium)
library(rvest) 
startServer() 
remDr <- remoteDriver(browserName = 'firefox')
remDr$open()
Vnum<- c("YV4SZ592371288470","1G2HX54K724118697","YV4SZ592371288470")

kk<-lapply(Vnum,function(j){

  remDr$navigate(paste("http://www.vindecoder.net/?vin=",j,"&submit=Decode",sep=""))
  Sys.sleep(30) # this is critical
  test.html <- html(remDr$getPageSource()[[1]]) # this is RSelenium but after this we can use rvest functions until we close the session
  test.text<-test.html%>%
  html_nodes(".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")%>%
  html_text()
})
kk
[[1]]
[1] "Model: XC70"                          "Type: Multipurpose Passenger Vehicle" "Make: Volvo"                         

[[2]]
[1] "Model: Bonneville"            "Make (Manufacturer): Pontiac" "Model year: 2002"            

[[3]]
[1] "Model: XC70"                          "Type: Multipurpose Passenger Vehicle" "Make: Volvo"   

remDr$close()

P.S. You can see that the same css path is not applicable for all VINs. You have to figure out that in advance (I just used the path that you provided in the question). You can use some sort of tryCatch.

like image 50
Metrics Avatar answered Nov 15 '22 13:11

Metrics