I'm searching for a way to fetch some data from a website using LibreOffice's Calc.
I was using Google Sheets with the IMPORTXML function before, but because it's very unreliable I want to use Calc instead.
My functions there looked like this:
=IMPORTXML(E2; "//h3[@class='product-name']")
=IMPORTXML(E2; "//span[@class='price']")
As you can already guess, the URL was in E2 (f.i. http://www.killis.at/gin/monkey-47-gin-distiller-s-cut-2016-0-5-lt.html).
In Calc I've tried =FILTERXML(WEBSERVICE(E2);"//h3[@class='product-name']") only to get #VALUE! as a result.
My LibreOffice's version is 6.0.4.2 with German locale. I use English function names with ";" as a separator.
So what would be the equivalent for this function in Calc and how would the appropriate command for the product name and price look like?
The problem is that while IMPORTXML claims to be able parsing tag soup HTML, which is not true in all cases, the FILTERXML needs a valid XML stream per definition. And tag soup HTML is not a valid XML stream. To be honest, HTML mostly is the opposite of a valid XML stream.
So the only way would be using third party tag soup parsers or taking the HTML tag soup as string and using string operations to find the needed parts of the string.
The second approach could look like this:
Public Function GETFROMHTML(sURL as String, sStartTag as String) as String
on error goto onErrorExit
oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
dim delimiters() as long
sContent = oInpDataStream.readString(delimiters(), false)
lStartPos = instr(1, sContent, sStartTag )
if lStartPos = 0 then
GETFROMHTML = "tag " & sStartTag & " not found"
exit function
end if
lEndPos = instr(lStartPos, sContent, "</")
lStartPos = lStartPos + 1 + len(sStartTag)
sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))
GETFROMHTML = sText
onErrorExit:
on error goto 0
End Function
Used in Calc cells like so:
=GETFROMHTML(E2; "<h3 class=""product-name""")
or
=GETFROMHTML(E2; "<span class=""price""")
Using a Sub could look like this:
sub getProductNameAndPrice()
on error resume next
oDoc = ThisComponent
oSheet = oDoc.CurrentController.ActiveSheet
for r = 0 to 9 'row 1 to 10 (0 based)
sURL = oSheet.getCellByPosition(4, r).String 'get string value from column 4 (E)
oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
if not isNull(oInpDataStream.InputStream) then
dim delimiters() as long
sContent = oInpDataStream.readString(delimiters(), false)
sStartTag = "<h3 class=""product-name"""
lStartPos = instr(1, sContent, sStartTag)
if lStartPos <> 0 then
lEndPos = instr(lStartPos, sContent, "</")
lStartPos = lStartPos + 1 + len(sStartTag)
sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))
oSheet.getCellByPosition(5, r).String = sText
end if
sStartTag = "<span class=""price"""
lStartPos = instr(1, sContent, sStartTag)
if lStartPos <> 0 then
lEndPos = instr(lStartPos, sContent, "</")
lStartPos = lStartPos + 1 + len(sStartTag)
sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))
oSheet.getCellByPosition(6, r).String = sText
end if
end if
next
on error goto 0
end sub
This code takes the URLs from column E rows 1 to 10 and writes product name in column F and price in column G of the row.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With