Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use R (Rcurl/XML packages ?!) to scrape this webpage?

Tags:

r

web-scraping

I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:

I would like to go through all the "species pages" present in this link:

http://gtrnadb.ucsc.edu/

So for each of them I will go to:

  1. The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
  2. And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)

Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):

chr.trna3 (1-77)    Length: 77 bp
Type: Ala   Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....

Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)

I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

Thanks for any help,

Tal

like image 248
Tal Galili Avatar asked Mar 14 '10 18:03

Tal Galili


People also ask

How do you scrape data in R studio?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.


2 Answers

Tal,

You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.

Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:

  1. Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
  2. Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.

So, here goes...

library(RCurl)

### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]

get_data_url<-function(s) {
    u_split1<-strsplit(s,"/")[[1]][1]
    u_split2<-strsplit(u_split1,'\\"')[[1]][2]
    ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}

# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]

### 2) Now, scrape the genome data from all of those URLS ###

# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
    g_split1<-strsplit(g,"\n")[[1]]
    g_split1<-g_split1[2:5]
    # Pull all of the data and stick it in a list
    g_split2<-strsplit(g_split1[1],"\t")[[1]]
    ID<-g_split2[1]                             # Sequence ID
    LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
    g_split3<-strsplit(g_split1[2],"\t")[[1]]
    TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
    AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
    SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
    STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
    return(c(ID,LEN,TYPE,AC,SEQ,STR))
}

# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
    struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
    raw_data<-getURLContent(struct_url)
    s_split1<-strsplit(raw_data,"<PRE>")[[1]]
    all_data<-s_split1[seq(3,length(s_split1))]
    data_list<-lapply(all_data,parse_genomes)
    for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
    return(data_list)
}

# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them

head(genome_data)

The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:

head(genome_data)
                                   ID   LEN TYPE                           AC                                                                       SEQ
1     Scaffold17302.trna1 (1426-1498) 73 bp  Ala     AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2   Scaffold20851.trna5 (43038-43110) 73 bp  Ala   AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3   Scaffold20851.trna8 (45975-46047) 73 bp  Ala   AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4     Scaffold17302.trna2 (2514-2586) 73 bp  Ala     AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp  Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6     Scaffold17302.trna4 (6027-6099) 73 bp  Ala     AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
                                                                        STR  NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp

I hope this helps, and thanks for the fun little Sunday afternoon R challenge!

like image 72
DrewConway Avatar answered Sep 21 '22 05:09

DrewConway


Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.

like image 28
Chris Avatar answered Sep 21 '22 05:09

Chris