I would like to scrape drug informations offered by the Swiss government for an University research project from:
http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=
The page does offer a robotx.txt file, however, it's content is freely available to the public and I assume that scraping this data is unprohibited.
This is an update of this question, since I made some progress.
What I achieved so far
# opens the first results page
# opens the first link as a table at the end of the page
library("rvest")
library("dplyr")
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
next: get the basic data
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
next: get the additional data
# gives the desired informations (=additional data) of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
My Problem:
# if I open the second search page
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
next: get the new basic data
# I get easily a table with the new results
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
But if I try to get the new additional data, I get the results from page 1 again:
# does not give the desired output:
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
What I am looking for: the detailed data of the first drug of page 2
Questions:
__VIEWSTATE
that might
change during the new request_POST
?I think you are simply overthinking the problem. The issue lies in the xpath
. Essentially the xpath
that you are using for data extraction is the same for all pages. And it is, //*[@id="ctl00_cphContent_gvwPreparations"]
The only component that is changing in your code is the txtPageNumber
. In the below code, I've changed the txtPageNumber
to 3
, like, txtPageNumber=3
I suggest your focus should be on something like, How to automate page numbering for data extraction?. This way, you'll not have to manually change the txtPageNumber
in
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
The following code worked for me;
library(rvest)
library(dplyr)
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
# A tibble: 11 x 1
.$`` $Präparat $`Galen. Form /~ $Packung $FAP $PP $SB $`Lim-Pkt` $Lim
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 21. Accolate Tabl 20 mg 60 Stk 29.75 50.55 "" "" ""
2 22. Accupaque Inj Lös 300 mg Plast F~ 32.00 53.10 "" "" ""
3 23. Accupaque Inj Lös 300 mg Plast F~ 61.15 86.60 "" "" ""
4 24. Accupaque Inj Lös 300 mg Plast F~ 120.~ 154.~ "" "" ""
5 25. Accupaque Inj Lös 350 mg Plast F~ 33.97 55.35 "" "" ""
6 26. Accupaque Inj Lös 350 mg Plast F~ 66.88 93.20 "" "" ""
7 27. Accupaque Inj Lös 350 mg Plast F~ 129.~ 164.~ "" "" ""
8 28. Accupro ~ Filmtabl 10 mg 30 Stk 8.56 18.00 "" "" ""
9 29. Accupro ~ Filmtabl 10 mg 100 Stk 26.60 46.90 "" "" ""
10 30. Accupro ~ Filmtabl 20 mg 30 Stk 14.02 28.35 "" "" ""
11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
# ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
# $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
# Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>
# gives the desired informations of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_text %>%
head(10)
[1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n 21.\r\n \r\n Accolate\r\n \r\n Tabl 20 mg \r\n \r\n 60 Stk\r\n \r\n 29.75\r\n \r\n 50.55\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n 53750036\r\n \r\n AstraZeneca AG\r\n \r\n Zafirlukastum\r\n \r\n 17053\r\n \r\n 15.03.1998\r\n \r\n \r\n \r\n \r\n \r\n \r\n 03.04.50.\r\n \r\n R03DC01\r\n \r\n\t\t\t\t\r\n 22.\r\n \r\n Accupaque\r\n \r\n
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With