Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: scraping additional data after POST only works for first page

I would like to scrape drug informations offered by the Swiss government for an University research project from:

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=

The page does offer a robotx.txt file, however, it's content is freely available to the public and I assume that scraping this data is unprohibited.

This is an update of this question, since I made some progress.

What I achieved so far

# opens the first results page 
# opens the first link as a table at the end of the page

library("rvest")
library("dplyr")


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

next: get the basic data

# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

next: get the additional data

# gives the desired informations (=additional data) of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text 

My Problem:

# if I open the second  search page

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

next: get the new basic data

# I get easily a table with the new results

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

But if I try to get the new additional data, I get the results from page 1 again:

# does not give the desired output:

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text 

What I am looking for: the detailed data of the first drug of page 2 enter image description here

Questions:

  • Why do I get duplicate results? Is it because of the __VIEWSTATE that might change during the new request_POST ?
  • Is there any way to solve this problem?
  • Is there any better way how to get the basic and additional data? If yes, how?
like image 769
captcoma Avatar asked May 09 '19 22:05

captcoma


1 Answers

I think you are simply overthinking the problem. The issue lies in the xpath. Essentially the xpath that you are using for data extraction is the same for all pages. And it is, //*[@id="ctl00_cphContent_gvwPreparations"] The only component that is changing in your code is the txtPageNumber. In the below code, I've changed the txtPageNumber to 3, like, txtPageNumber=3 I suggest your focus should be on something like, How to automate page numbering for data extraction?. This way, you'll not have to manually change the txtPageNumber in

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                           ),
                           encode="form")

The following code worked for me;

library(rvest)
library(dplyr)


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                           ),
                           encode="form")
# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

# A tibble: 11 x 1
   .$``  $Präparat $`Galen. Form /~ $Packung $FAP  $PP   $SB   $`Lim-Pkt` $Lim 
   <chr> <chr>     <chr>            <chr>    <chr> <chr> <chr> <chr>      <chr>
 1 21.   Accolate  Tabl 20 mg       60 Stk   29.75 50.55 ""    ""         ""   
 2 22.   Accupaque Inj Lös 300 mg   Plast F~ 32.00 53.10 ""    ""         ""   
 3 23.   Accupaque Inj Lös 300 mg   Plast F~ 61.15 86.60 ""    ""         ""   
 4 24.   Accupaque Inj Lös 300 mg   Plast F~ 120.~ 154.~ ""    ""         ""   
 5 25.   Accupaque Inj Lös 350 mg   Plast F~ 33.97 55.35 ""    ""         ""   
 6 26.   Accupaque Inj Lös 350 mg   Plast F~ 66.88 93.20 ""    ""         ""   
 7 27.   Accupaque Inj Lös 350 mg   Plast F~ 129.~ 164.~ ""    ""         ""   
 8 28.   Accupro ~ Filmtabl 10 mg   30 Stk   8.56  18.00 ""    ""         ""   
 9 29.   Accupro ~ Filmtabl 10 mg   100 Stk  26.60 46.90 ""    ""         ""   
10 30.   Accupro ~ Filmtabl 20 mg   30 Stk   14.02 28.35 ""    ""         ""   
11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
# ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
#   $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
#   Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>

# gives the desired informations of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_text %>%
  head(10)


[1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n                        21.\r\n                    \r\n                        Accolate\r\n                    \r\n                        Tabl 20 mg \r\n                    \r\n                        60 Stk\r\n                    \r\n                        29.75\r\n                    \r\n                        50.55\r\n                    \r\n                                                \r\n                    \r\n                        \r\n                    \r\n                      \r\n                    \r\n                        53750036\r\n                    \r\n                        AstraZeneca AG\r\n                    \r\n                        Zafirlukastum\r\n                    \r\n                        17053\r\n                    \r\n                        15.03.1998\r\n                    \r\n                        \r\n                        \r\n                    \r\n                        \r\n                    \r\n                        03.04.50.\r\n                    \r\n                        R03DC01\r\n                    \r\n\t\t\t\t\r\n                        22.\r\n                    \r\n                        Accupaque\r\n                    \r\n 
like image 145
mnm Avatar answered Nov 14 '22 20:11

mnm