Convert HTML-table to Excel
The code below fetches the HTML-table at https://rasmusrhl.github.io/stuff, and converts it to Excel-format.
The problem is that:
Solution
Thank you all for your great contributions. The varied anwers helped me understand, that for my purposes a workaround was the best solution: Because I generate the HTML-tables myself, I can control the CSS of each cell. CSS codes exists that instruct Excel how to interpret cell contents: http://cosicimiento.blogspot.dk/2008/11/styling-excel-cells-with-mso-number.html, also explained in this question: Format HTML table cell so that Excel formats as text?
In my case the CSS should be text, which is mso-number-format:\"\\@\"
. It is integrated in R code below:
library(htmlTable)
library(nycflights13)
library(dplyr)
nycflights13::planes %>%
slice(1:10) %>% mutate( seats = seats*1.0001,
s1 = c("1-5", "5-10", "1/2", "1/10", "2-3", "1", "1.0", "01", "01.00", "asfdkjlæ" ),
s2 = c("(10)", "(12)", "(234)", "(00)", "(01)", "(098)", "(01)", "(01.)", "(001.0)", "()" )) -> df
rle_man <- rle(df$manufacturer)
css_matrix <- matrix( data = "mso-number-format:\"\\@\"", nrow = nrow(df), ncol = ncol(df))
css_matrix[,1] <- "padding-left: 0.4cm;mso-number-format:\"\\@\""
css_matrix[,2:10] <- "padding-left: 1cm;mso-number-format:\"\\@\""
css_matrix[,5] <- "padding-left: 2cm;mso-number-format:\"\\@\""
htmlTable( x = df,
rgroup = rle_man$values, n.rgroup = rle_man$lengths,
rnames = FALSE, align = c("l", "r" ),
cgroup = rbind( c("", "Some text goes here. It is long and does not break", "Other text goes here", NA),
c( "", "Machine type<br>(make)", "Specification of machine", "Other variables")),
n.cgroup = rbind( c(1,8,2, NA),
c(1, 3, 5, 2)),
css.cell = css_matrix ) -> html_out
temp_file <- tempfile( pattern = "table", fileext = ".html" )
readr::write_file( x = html_out, path = temp_file)
utils::browseURL( temp_file)
That HTML-file can be dragged and dropped into Excel with all cells interpreted as text. Note, only dragging-and-dropping the html-file into excel works, it does not work to open the table in a browser and copy-pasting it into excel.
The only thing missing from this method is the horizontal lines, but I can live with that.
Below is VBA with the same effect as dragging and dropping:
Sub importhtml()
'
' importhtml Macro
'
'
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;file:///C:/Users/INSERTUSERNAME/Desktop/table18b85c0a20f3html.HTML", Destination:=Range("$a$1"))
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = False
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlEntirePage
.WebFormatting = xlWebFormattingAll
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = True
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
For a client side solution
So run this code after the first block of code, it rewrites the final two columns.
Sub Test2()
'* tools references ->
'* Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl("https://rasmusrhl.github.io/stuff/", "")
While oHtml.readyState <> "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim oTRs As MSHTML.IHTMLDOMChildrenCollection
Set oTRs = oHtml.querySelectorAll("TR")
Debug.Assert oTRs.Length = 17
Dim lRowNum As Long
For lRowNum = 3 To oTRs.Length - 1
Dim oTRLoop As MSHTML.HTMLTableRow
Set oTRLoop = oTRs.Item(lRowNum)
If oTRLoop.ChildNodes.Length > 1 Then
Debug.Assert oTRLoop.ChildNodes.Length = 14
Dim oSecondToLastColumn As MSHTML.HTMLTableCell
Set oSecondToLastColumn = oTRLoop.ChildNodes.Item(12)
ActiveSheet.Cells(lRowNum + 2, 13).Value2 = "'" & oSecondToLastColumn.innerText
Dim oLastColumn As MSHTML.HTMLTableCell
Set oLastColumn = oTRLoop.ChildNodes.Item(13)
ActiveSheet.Cells(lRowNum + 2, 14).Value2 = "'" & oLastColumn.innerText
End If
'Stop
Next lRowNum
ActiveSheet.Columns("M:M").EntireColumn.AutoFit
ActiveSheet.Columns("N:N").EntireColumn.AutoFit
End Sub
For a Server Side Solution
Now that we know you control the source script and that it is in R then one can change the R script to style the final columns with mso-number-format:'\@' . Here is a sample R script that achieves this, one builds a CSS matrix of the same dimensions as the data and passes the CSS matrix as a parameter into htmlTable
. I have not tampered with your R source instead I give here a simple illustration for you to interpret.
A=matrix(c("(2)","(4)","(3)","(1)","(5)","(7)"),nrow=2,ncol=3,byrow=TRUE)
css_matrix <- matrix(data="",nrow=2,ncol=3)
css_matrix[,3] <- "mso-number-format:\"\\@\""
htmlTable(x=A,css.cell=css_matrix)
Opening in Excel I get this
Robin Mackenzie adds
you might mention in your server-side solution that OP just needs to add css_matrix[,10:11] <- "mso-number-format:\"\@\"" to their existing R code (after the last css_matrix... line) and it will implement your solution for their specific problem
Thanks Robin
To get the tabular data (keeping the format as it is) from that page, you can try like below:
Sub Fetch_Data()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim posts As Object, post As Object, elem As Object
Dim row As Long, col As Long
With http
.Open "GET", "https://rasmusrhl.github.io/stuff/", False
.send
html.body.innerHTML = .responseText
End With
Set posts = html.getElementsByClassName("gmisc_table")(0)
For Each post In posts.Rows
For Each elem In post.Cells
col = col + 1: Cells(row + 1, col).NumberFormat = "@": Cells(row + 1, col) = elem.innerText
Next elem
col = 0
row = row + 1
Next post
End Sub
Reference to add to the library:
1. Microsoft HTML Object Library
2. Microsoft XML, v6.0 'or whatever version you have
This is how that portion looks like when get parsed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With