Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath to get text of a specific length

Tags:

xpath

I am trying to create an XPath query that will get 549 characters of text every time. The text should be about the related subject, in the example below it is oranges or apples or pears. If there doesn't exist elements on the page that contain these words, then I would like the XPath query to find easier to target / less specific text on the page.

So to clarify, I am trying to create an XPath query that finds elements that contain a particular kind of text, if 549 or more characters are found using the query below, then we are done, if none is found or if the text returned is less than 549 characters, I would like the XPath query to get ANY text on the page that is in paragraph form (anything will work except text from buttons, links, menu's, etc.) and return 549 characters of this text, if the resulting string is less than 549 characters I would like to concatenate these two queries with the the following: ... in the middle.

   substring(normalize-space(//*[self::p or self::div][contains(text(),'apples') or contains(text(),'oranges') or contains(text(),'pears')]), 0, 549)

I have been trying to work this out for quite a while and I would appreciate any suggestions!

Many thanks in advance!

like image 761
AnchovyLegend Avatar asked Jul 16 '13 22:07

AnchovyLegend


1 Answers

Yes. There is a string-length() function in xpath that you can use in your predicate:

substring(normalize-space(//*[string-length( text()) > 549 and (... other conditions ...)]),0,549)

See "Is there an "if -then - else " statement in XPath?" for how to do conditionals to determine if you need to add the ellipsis.

Adapting an example from the above SO question:

if (fn:string-length(normalize-space(//*[self::p or self::div][contains(text(),'apples']) > 549)
        then (concat( fn:substring(normalize-space(//*[self::p or self::div][contains(text(),'apples']), 0, 5490), "...") )
        else (normalize-space(//*[self::p or self::div][contains(text(),'apples']))

This seems to me to be really complicated in XPath. If you can use XQuery, you'll have a much more readable transform:

for $text in normalize-space(//*[self::p or self::div])
where $text[contains(text(),'apples' or ...]
return
    if (string-length( $text) > 549) then
        concat( substring( $text, 0, 549), "...")
    else
        $text

I suspect this can actually be optimized further (for readability, maintenance) with multiple and nested for statements to deal with the various fruits you need.

If using XSL:

<xsl:template match="//*[self::p or self::div][contains(text(),'apples' or ...]">
    <xsl:variable name="text" select="normalize-space( . )" />
    <xsl:choose>
        <xsl:when test="string-length( $text)">
            <xsl:value-of select="substring( $text, 0, 549)"/>...
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$text"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

You can also use the matches() xpath function, to avoid having so many contains() predicates, by constructing a regular expression:

matches( //*[self::p or self::div][matches(text(),'(apples|oranges|bananas)'])

Finally, be aware that using // and * in the XPath is highly inefficient, and you will see performance impacts if your document has any weight to it. I have an itch that tells me there's a way to optimize this, but unfortunately I don't have the time to research.

like image 61
PaulProgrammer Avatar answered Sep 28 '22 17:09

PaulProgrammer