Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert html to plain text in VBA

I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out.

I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back.

like image 218
Mark Avatar asked Mar 16 '11 15:03

Mark


2 Answers

Yes! I managed to solve my problem as well. Thanks everybody/

In my case, I had this sort of input:

<p>Lorem ipsum dolor sit amet.</p>

<p>Ut enim ad minim veniam.</p>

<p>Duis aute irure dolor in reprehenderit.</p>

And I did not want the result to be all jammed together without breaklines.

So I first splitted my input for every <p> tag into an array 'paragraphs', then for each element I used Tim's answer to get the text out of html (very sweet answer btw).

In addition I concatenated each cleaned 'paragraph' with this breakline character Crh(10) for VBA/Excel.

The final code is:

Public Function HtmlToText(ByVal sHTML As String) As String
    Dim oDoc As HTMLDocument
    Dim result As String
    Dim paragraphs() As String

    If IsNull(sHTML) Then
      HtmlToText = ""
      Exit Function
    End If

    result = ""
    paragraphs = Split(sHTML, "<p>")

    For Each paragraph In paragraphs
        Set oDoc = New HTMLDocument
        oDoc.body.innerHTML = paragraph
        result = result & Chr(10) & Chr(10) & oDoc.body.innerText
    Next paragraph

    HtmlToText = result
End Function

like image 71
ofundefined Avatar answered Sep 18 '22 13:09

ofundefined


A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.

Function StripTags(ByVal html As String) As String
    Dim text As String
    Dim accumulating As Boolean
    Dim n As Integer
    Dim c As String

    text = ""
    accumulating = True

    n = 1
    Do While n <= Len(html)

        c = Mid(html, n, 1)
        If c = "<" Then
            accumulating = False
        ElseIf c = ">" Then
            accumulating = True
        Else
            If accumulating Then
                text = text & c
            End If
        End If

        n = n + 1
    Loop

    StripTags = text
End Function

This can leave lots of extraneous whitespace, but it will help in removing the tags.

like image 22
Todd Avatar answered Sep 21 '22 13:09

Todd