Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fetch data from a website using LibreOffice's Calc?

I'm searching for a way to fetch some data from a website using LibreOffice's Calc.

I was using Google Sheets with the IMPORTXML function before, but because it's very unreliable I want to use Calc instead.

My functions there looked like this:

=IMPORTXML(E2; "//h3[@class='product-name']")

=IMPORTXML(E2; "//span[@class='price']")

As you can already guess, the URL was in E2 (f.i. http://www.killis.at/gin/monkey-47-gin-distiller-s-cut-2016-0-5-lt.html).

In Calc I've tried =FILTERXML(WEBSERVICE(E2);"//h3[@class='product-name']") only to get #VALUE! as a result.

My LibreOffice's version is 6.0.4.2 with German locale. I use English function names with ";" as a separator.

So what would be the equivalent for this function in Calc and how would the appropriate command for the product name and price look like?

like image 351
DMT Avatar asked Oct 20 '25 14:10

DMT


1 Answers

The problem is that while IMPORTXML claims to be able parsing tag soup HTML, which is not true in all cases, the FILTERXML needs a valid XML stream per definition. And tag soup HTML is not a valid XML stream. To be honest, HTML mostly is the opposite of a valid XML stream.

So the only way would be using third party tag soup parsers or taking the HTML tag soup as string and using string operations to find the needed parts of the string.

The second approach could look like this:

Public Function GETFROMHTML(sURL as String, sStartTag as String) as String

   on error goto onErrorExit
   oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
   oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
   oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
   dim delimiters() as long
   sContent = oInpDataStream.readString(delimiters(), false)

   lStartPos = instr(1, sContent, sStartTag )
   if lStartPos = 0 then
     GETFROMHTML = "tag " & sStartTag & " not found"
     exit function
   end if   
   lEndPos = instr(lStartPos, sContent, "</")
   lStartPos = lStartPos + 1 + len(sStartTag)
   sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))
   GETFROMHTML = sText

 onErrorExit:
   on error goto 0
End Function

Used in Calc cells like so:

=GETFROMHTML(E2; "<h3 class=""product-name""")

or

=GETFROMHTML(E2; "<span class=""price""")

Using a Sub could look like this:

sub getProductNameAndPrice()
   on error resume next

   oDoc = ThisComponent
   oSheet = oDoc.CurrentController.ActiveSheet

   for r = 0 to 9 'row 1 to 10 (0 based)

     sURL = oSheet.getCellByPosition(4, r).String 'get string value from column 4 (E)

     oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
     oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
     oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
     if not isNull(oInpDataStream.InputStream) then 
       dim delimiters() as long
       sContent = oInpDataStream.readString(delimiters(), false)

       sStartTag = "<h3 class=""product-name"""

       lStartPos = instr(1, sContent, sStartTag)
       if lStartPos <> 0 then
         lEndPos = instr(lStartPos, sContent, "</")
         lStartPos = lStartPos + 1 + len(sStartTag)
         sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))

         oSheet.getCellByPosition(5, r).String = sText
       end if 

       sStartTag = "<span class=""price"""

       lStartPos = instr(1, sContent, sStartTag)
       if lStartPos <> 0 then
         lEndPos = instr(lStartPos, sContent, "</")
         lStartPos = lStartPos + 1 + len(sStartTag)
         sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))

         oSheet.getCellByPosition(6, r).String = sText
       end if   

     end if

   next

   on error goto 0
end sub

This code takes the URLs from column E rows 1 to 10 and writes product name in column F and price in column G of the row.

like image 89
Axel Richter Avatar answered Oct 22 '25 08:10

Axel Richter