Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read complex html file into R with rvest

Tags:

html

r

rvest

I am new to R and stackoverflow so please be gentle, I will try to keep this post as correct as possible. I am working on a project to compare whole exome sequencing (WES) results to proteome data. Our WES facility gives out the data as an html file only, so that I need to read it into R to continue my work.

I tried to follow the DataCamp tutorial for rvest but I think the problem might be that the html files are too complex as what I get is a mess of \t\t\tn\n\t's with some text in between. I suppose the problem is an incorrect html_node?

Here is my R code, followed by a shortened and variant modified HTML.

What I would like to get is a data frame with the same columns as in the html. As in the example, some variants affect multiple transcripts, in these cases single rows/transcript would be perfect but its not a must by any means.

Thank you very much for your help!

Sebastian

library(tidyverse)  
library(rvest)    

htmlALL <- read_html("Example_html")

getDATA <- function(html){
html %>%
html_nodes(".table") %>%
html_text() %>%
str_trim() %>%
unlist()

}

df_html <- getDATA(htmlALL)

<!DOCTYPE html
	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
  <!-- add title in the brower tab bar -->
  <title>Homozygous variants of sample XXX </title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>


<!-- change style to look nice -->
<style type="text/css">


html { 
  text-align: center;
  vertical-align: middle;
  height: 100%;
  width: 100%;
}
body { 
  background: #eee url('http://i.imgur.com/eeQeRmk.png'); /* http://subtlepatterns.com/weave/ */
  font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
  font-size: 62.5%;
  entry-height: 1;
  color: #585858;
  padding: 22px 10px;
  padding-bottom: 55px;

}

::selection { background: #5f74a0; color: #fff; }
::-moz-selection { background: #5f74a0; color: #fff; }
::-webkit-selection { background: #5f74a0; color: #fff; }

br { display: block; entry-height: 1.6em; } 

input, textarea { 
  -webkit-font-smoothing: antialiased;
  -webkit-text-size-adjust: 100%;
  -ms-text-size-adjust: 100%;
  -webkit-box-sizing: border-box;
  -moz-box-sizing: border-box;
  box-sizing: border-box;
  outentry: none; 
}

blockquote, q { quotes: none; }
blockquote:before, blockquote:after, q:before, q:after { content: ''; content: none; }
strong, b { font-weight: bold; } 


h1 {
  font-weight: bold;
  font-size: 3.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

h2 {
  font-weight: bold;
  font-size: 2.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

/** big white sheet everything is on **/
.wrapper {
  display: block;
  width: 95%;
  background: #fff;
  margin: 0 auto;
  padding: 10px 17px 100px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  overflow-x: auto;
  overflow-y: visible;
}

/* smaller box the family information is on */
.info{
  display: block;
  width: 800px;
  background: #f2f2f2;
  margin: 0 auto;
  padding: 10px 17px 10px 10px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  font-size: 1.8em;
  margin-bottom: 10px;
}


/* this is what actually contains the info */
.table {
  display: table;
  margin: 0 auto;
  width: 99%;
  font-size: 1.2em;
  margin-bottom: 15px;
  border-collapse: collapse;
  overflow: visible;
}

/* one row of the variants */
.tablerow {
  display: table-row;
  overflow: visible;
  border: 1px solid gray;
  width: 100%;
}

/* header are bigger and may in the future be clickable to sort accordginly*/
.tableheader {
  display: table-cell;
  background: #f2f2f2;
  padding: 3px 10px;
  margin-bottom: 25px;
  font-size: 1.8em;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
}

/* in the following each column gets specified to increase readablity*/

.position {
  display: table-cell;
  padding: 3px 10px;
  font-size: 1.4em;
  height: 100%;
  text-align: center;
  vertical-align: middle;
}

.variants {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  overflow: visible;
  white-space: nowrap;
  
}

.stacked {
  display: table;
  height: 50%;
  width: 100%;

}

.center {
  display: table-cell;
  vertical-align: middle;
  width: 100%;
  padding: 0px 5px;
}


.consequences {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 3px 10px;
}

.gene {
  display: table-cell;
  padding: 3px 15px;
  height: 100%;
  vertical-align: middle;
  font-size: 1.4em;
  font-weight: bold;
}

.transcripts {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.list {
  height: 100%;
  width: 100%;
  display: table;
  table-layout: fixed;
}
.row {
  display: table-row;
  overflow: visible;
  vertical-align: middle;
}
.entry {
  display: table-cell;
  vertical-align:middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

.cdspos {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.exon {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}



.hgvs {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}

.hgvs .list .row{
  display: table-row;
  vertical-align: middle;
}

.polyphen {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.polyphen .list .row{
  display: table-row;
  vertical-align: middle;
}

.sift {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.sift .list .row{
  display: table-row;
  vertical-align: middle;
}

.allelefreq {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}



/* Tooltip container */
.tooltip_gene, .tooltip_allelefrq ,.tooltip_qual{
    position: relative;
    display: inline-block;
    border-bottom: 1px dotted black; /* If you want dots under the hoverable text */
    
}



.tooltiptext{
    visibility: hidden;
    overflow: auto;
    min-width: 400px;
    background-color: #ffb380;
    color: black;
    text-align: left;
    padding: 5px 10px;
    border-radius: 6px;
    font-size: 12pt;
    font-weight: normal;
    
    /* Position the tooltip text - see examples below! */
    position: absolute;
    z-index:1;
    
    /* shadow */
    box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    
    opacity: 0.95;
    filter: alpha(opacity=95);

}

/* Tooltip text */
.tooltip_gene .tooltiptext {
    top: -5px;
    left: 105%;
 
}


/* Tooltip text */
.tooltip_allelefrq .tooltiptext {
    top: -5px;
    right: 105%;
    min-width: 120px;
    
 
}

/* Show the tooltip text when you mouse over the tooltip container */
.tooltip_allelefrq:hover .tooltiptext, .tooltip_gene:hover .tooltiptext {
    visibility: visible;
}


.clin {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

</style>


<body>
  <div class="wrapper">
      <!-- add info about patients -->
      <h1>Homozygous variants of sample XXX</h1>
      <h2>Tue Jan 23 09:01:56 2018</h2>
      <div class="info">
	
	  Patient only<br>
	
      </div>
      <!-- variants table start -->
      <div class="table">
	<!-- table header start -->
	<div class="tablerow">
	  <div class="tableheader">
	    Position
	  </div>
	  <div class="tableheader">
	    Variant
	  </div>
	  <div class="tableheader">
	    Cons
	  </div>
	  <div class="tableheader">
	    Gene
	  </div>
	  <div class="tableheader">
	    Transcript
	  </div>
	  <div class="tableheader">
	    HGVSC
	  </div>
	  <div class="tableheader">
	    HGVSP
	  </div>
	  <div class="tableheader">
	    PolyPhen
	  </div>
	  <div class="tableheader">
	    SIFT
	  </div>
	  <div class="tableheader">
	    AF
	  </div>
	  <div class="tableheader">
	    Clin
	  </div>
	</div>
	<!-- table header stop -->
	<!-- var loop start -->
	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-117635467-117635507">1:117635487</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  G->T
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=TTF2" >
		      TTF2
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
TTF2 (Transcription Termination Factor 2) is a Protein Coding gene.
Diseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.
Among its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.
GO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.
An important paralog of this gene is HLTF.</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000369466">ENST00000369466
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00000
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>0</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>0</span><hr>inhouse:<span style='float:right;'>0.00118</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	
	 	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-149898435-149898475">1:149898455</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  
		      <a href="https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs143105666">G->A</a>
		  
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=SF3B4" >
		      SF3B4
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
SF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.
Diseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.
Among its related pathways are mRNA Splicing - Major Pathway and Gene Expression.
GO annotations related to this gene include nucleic acid binding and nucleotide binding.
</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000457312">ENST00000457312
		      </a>
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000271628">ENST00000271628
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A(p.%3D)
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00021
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>57</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>277082</span><hr>inhouse:<span style='float:right;'>0.00236</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	 	
	<!-- var loop stop -->
      </div>
      <!-- variant table stop -->
    </div>
</body>
</html>
like image 937
Sebastian Hesse Avatar asked Sep 12 '18 14:09

Sebastian Hesse


People also ask

What is Rvest package in R?

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.

How do you scrape an Rvest?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.

What does Read_html do in R?

The read_html command creates an R object, basically a list, that stores information about the web page.


1 Answers

Here's the best I can offer you. Note that the output includes the "tooltip text" that pops up when you hover over the data in the Gene column.

library(rvest)

# I saved your sample to my Desktop as test.html
pg = read_html('~/Desktop/test.html')

# count rows (including header):
n_rows = pg %>% html_nodes('div.tablerow') %>% length

# sprintf-friendly format to get the %d-th node matching
#   //div[@class="tablerow"] (same as div.tablerow in CSS)
#   All of the /div after this are columns
xp_fmt = '//div[@class="tablerow"][%d]/div'

# div.tableheader nodes contain column names
col_names = pg %>% html_nodes(xpath = sprintf(xp_fmt, 1L)) %>% 
  html_text %>% trimws

# rows 2:n contain the actual data; gsub is
#   stripping leading/trailing whitespace and 
#   any duplicate internal whitespace
rows = lapply(2:n_rows, function(ii) {
  pg %>% html_nodes(xpath = sprintf(xp_fmt, ii)) %>% 
    html_text %>% gsub('^\\s+|\\s{2,}|\\s+$', '', .)
})

# can't forget those pesky factors
DF = as.data.frame(do.call(rbind, rows), stringsAsFactors = FALSE)
names(DF) = col_names
DF
#      Position Variant       Cons
# 1 1:117635487    G->T synonymous
# 2 1:149898455    G->A synonymous
#                                                                                                                                                                                                                                                                                                                                                                                                                                                     Gene
# 1 TTF2GeneCards Summary\nTTF2 (Transcription Termination Factor 2) is a Protein Coding gene.\nDiseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.\nAmong its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.\nGO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.\nAn important paralog of this gene is HLTF.
# 2                                                       SF3B4GeneCards Summary\nSF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.\nDiseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.\nAmong its related pathways are mRNA Splicing - Major Pathway and Gene Expression.\nGO annotations related to this gene include nucleic acid binding and nucleotide binding.
#                       Transcript            HGVSC
# 1                ENST00000369466        c.2940G>T
# 2 ENST00000457312ENST00000271628 c.390C>Ac.519C>A
#                            HGVSP PolyPhen SIFT
# 1               c.2940G>T(p.%3D)              
# 2 c.390C>A(p.%3D)c.519C>A(p.%3D)              
#                                                         AF
# 1       0.00000allele countsht: 0hm: 0wt: 0inhouse:0.00118
# 2 0.00021allele countsht: 57hm: 0wt: 277082inhouse:0.00236
#   Clin
# 1     
# 2     

Note that it doesn't apply here since all of your columns appear to all be character type, but a more sophisticated approach would convert the rows here into a regular file (e.g. csv) and then use read.table (or better, fread) to read in the text and auto-detect the column types.

like image 148
MichaelChirico Avatar answered Oct 23 '22 14:10

MichaelChirico