tags", "text": "<p>I have some problem gettings all the html tags without <code>&lt;script&gt;</code> or <code>&lt;script ... /&gt;</code> using Xpath.</p>\n\n<p>For example, in this part of the HTML code, i want to remove : </p>\n\n<pre class="prettyprint"><code>&lt;script type="text/javascript" src="http://www.google.com/coop/cse/brand?form=cse-search-box&amp;amp;lang=fr"/&gt;\n</code></pre>\n\n<p>for this code</p>\n\n<pre class="prettyprint"><code>&lt;li&gt;&lt;!-- Search Google --&gt;\n&lt;center&gt;\n &lt;form action="http://www.google.fr/cse" id="cse-search-box" target="_blank"&gt;\n &lt;div&gt;\n &lt;input type="hidden" name="cx" value="partner-pub-0959382714089534:mw3ssl65jk1"/&gt;\n &lt;input type="hidden" name="ie" value="ISO-8859-1"/&gt;\n &lt;input type="text" name="q" size="31"/&gt;\n &lt;input type="submit" name="sa" value="Rechercher"/&gt;\n &lt;/div&gt;\n &lt;/form&gt;\n &lt;script type="text/javascript"\n src="http://www.google.com/coop/cse/brand?form=cse-search-box&amp;amp;lang=fr"/&gt;\n &lt;/center&gt;\n &lt;!-- Search Google --&gt;&lt;/li&gt;\n</code></pre>\n\n<p>I'm generating an xml file using Web-Harvest, and then i have to remove some specifics tags. \nI have try a lot of xpath (i'm working in the body of the html) :</p>\n\n<ul>\n<li><p><code>//body//*[not(name() = 'script')]</code></p></li>\n<li><p><code>//body//*[not(self::script)]</code></p></li>\n<li><p><code>//body//*[not(starts-with(name(),'script'))]</code></p></li>\n<li><p><code>//body//*[not(contains(name(),'script'))]</code></p></li>\n</ul>\n<p>but it's not working.</p>\n\n<p>Note that <code>//body//*[name() = 'script']</code> is working, but i want the opposite... </p>\n\n<p>Do you have some ideas ?</p>\n\n<p>Or more generaly, if you know how to remove all the <code>&lt;script&gt;</code> <code>&lt;script/&gt;</code> tag using Xpath, i'm also interest in :-)</p>\n\n<p>Thanks in advance.</p>", "answerCount": 2, "upvoteCount": 958, "dateCreated": "2011-04-20 09:23:30", "dateModified": "2022-09-23 04:46:11", "author": { "type": "Person", "name": "jbed" }, "acceptedAnswer": { "@type": "Answer", "text": "<p>Well first of all XPath selects nodes in an existing document, it does not remove them. And your path <code>//body//*</code> you start with selects all child and descendant elements of the <code>body</code> element. Even if you now add a predicate like <code>//body//*[not(self::script)]</code> that path still selects elements like the <code>li</code> and the <code>center</code> element that are not themselves <code>script</code> elements but which contain a <code>script</code> element. So <code>//body//*[not(self::script)]</code> is the right approach not to select any non-<code>script</code> elements but it does not help if you want for instance the original <code>center</code> element with the <code>script</code> element being removed. That is not something pure XPath can do for you, you would need to move to XSLT to transform the document and that way remove any <code>script</code> elements.</p>", "upvoteCount": 115, "url": "https://exchangetuts.com/xpath-getting-all-tags-without-script-and-script-tags-1641288783917705#answer-1658508195186157", "dateCreated": "2022-09-18 04:46:11", "dateModified": "2022-09-23 04:46:11", "author": { "type": "Person", "name": "Martin Honnen" } }, "suggestedAnswer": [ { "@type": "Answer", "text": "<p><strong>XPath is just a <em>query</em> language for XML documents and as such it cannot alter in any way the XML document(s)</strong> that is being queried.</p>\n\n<p>The most convenient way to produce a new XML document that is different from the initial XML document is by using XSLT.</p>\n\n<p><strong>This short and simple XSLT transformation</strong>:</p>\n\n<pre class="prettyprint"><code>&lt;xsl:stylesheet version="1.0"\n xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;\n &lt;xsl:output omit-xml-declaration="yes" indent="yes"/&gt;\n &lt;xsl:strip-space elements="*"/&gt;\n\n &lt;xsl:template match="node()|@*"&gt;\n &lt;xsl:copy&gt;\n &lt;xsl:apply-templates select="node()|@*"/&gt;\n &lt;/xsl:copy&gt;\n &lt;/xsl:template&gt;\n\n &lt;xsl:template match="script"/&gt;\n&lt;/xsl:stylesheet&gt;\n</code></pre>\n\n<p><strong>when applied on the provided XML document:</strong></p>\n\n<pre class="prettyprint"><code>&lt;li&gt;\n &lt;!-- Search Google --&gt;\n &lt;center&gt;\n &lt;form action="http://www.google.fr/cse"\n id="cse-search-box" target="_blank"&gt;\n &lt;div&gt;\n &lt;input type="hidden" name="cx"\n value="partner-pub-0959382714089534:mw3ssl65jk1"/&gt;\n &lt;input type="hidden" name="ie" value="ISO-8859-1"/&gt;\n &lt;input type="text" name="q" size="31"/&gt;\n &lt;input type="submit" name="sa" value="Rechercher"/&gt;\n &lt;/div&gt;\n &lt;/form&gt;\n &lt;script type="text/javascript"\n src="http://www.google.com/coop/cse/brand?form=cse-search-box&amp;amp;lang=fr"/&gt;\n &lt;/center&gt;\n &lt;!-- Search Google --&gt;\n&lt;/li&gt;\n</code></pre>\n\n<p><strong>produces the wanted, correct result</strong>:</p>\n\n<pre class="prettyprint"><code>&lt;li&gt;&lt;!-- Search Google --&gt;\n &lt;center&gt;\n &lt;form action="http://www.google.fr/cse" id="cse-search-box" target="_blank"&gt;\n &lt;div&gt;\n &lt;input type="hidden" name="cx" value="partner-pub-0959382714089534:mw3ssl65jk1"/&gt;\n &lt;input type="hidden" name="ie" value="ISO-8859-1"/&gt;\n &lt;input type="text" name="q" size="31"/&gt;\n &lt;input type="submit" name="sa" value="Rechercher"/&gt;\n &lt;/div&gt;\n &lt;/form&gt;\n &lt;/center&gt;&lt;!-- Search Google --&gt;\n&lt;/li&gt;\n</code></pre>", "upvoteCount": 38, "url": "https://exchangetuts.com/xpath-getting-all-tags-without-script-and-script-tags-1641288783917705#answer-1658508196628836", "dateCreated": "2022-09-16 04:46:11", "dateModified": "2022-09-23 04:46:11", "author": { "type": "Person", "name": "Dimitre Novatchev" } } ] } }
Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPATH getting all tags without <script> and </script> tags

Tags:

html

tags

xpath

I have some problem gettings all the html tags without <script> or <script ... /> using Xpath.

For example, in this part of the HTML code, i want to remove :

<script type="text/javascript" src="http://www.google.com/coop/cse/brand?form=cse-search-box&amp;lang=fr"/>

for this code

<li><!-- Search Google -->
<center>
                     <form action="http://www.google.fr/cse" id="cse-search-box" target="_blank">
                        <div>
                           <input type="hidden" name="cx" value="partner-pub-0959382714089534:mw3ssl65jk1"/>
                           <input type="hidden" name="ie" value="ISO-8859-1"/>
                           <input type="text" name="q" size="31"/>
                           <input type="submit" name="sa" value="Rechercher"/>
                        </div>
                     </form>
                     <script type="text/javascript"
                             src="http://www.google.com/coop/cse/brand?form=cse-search-box&amp;lang=fr"/>
                  </center>
                  <!-- Search Google --></li>

I'm generating an xml file using Web-Harvest, and then i have to remove some specifics tags. I have try a lot of xpath (i'm working in the body of the html) :

  • //body//*[not(name() = 'script')]

  • //body//*[not(self::script)]

  • //body//*[not(starts-with(name(),'script'))]

  • //body//*[not(contains(name(),'script'))]

but it's not working.

Note that //body//*[name() = 'script'] is working, but i want the opposite...

Do you have some ideas ?

Or more generaly, if you know how to remove all the <script> <script/> tag using Xpath, i'm also interest in :-)

Thanks in advance.

like image 958
jbed Avatar asked Apr 20 '11 09:04

jbed


People also ask

What is the purpose of the script and</ script tags?

The <script> tag in HTML is used to define the client-side script. The <script> tag contains the scripting statements, or it points to an external script file. The JavaScript is mainly used in form validation, dynamic changes of content, image manipulation, etc.

Can I use XPath on HTML?

Note that HTML and XML have a very similar structure, which is why XPath can be used almost interchangeably to navigate both HTML and XML documents.

Does JavaScript support XPath?

XPath Path Expressions These path expressions look very much like the expressions you see when you work with a traditional computer file system. XPath expressions can be used in JavaScript, Java, XML Schema, PHP, Python, C and C++, and lots of other languages.

How do script tags work in HTML?

The <script> tag is used to embed a client-side script (JavaScript). The <script> element either contains scripting statements, or it points to an external script file through the src attribute. Common uses for JavaScript are image manipulation, form validation, and dynamic changes of content.


2 Answers

Well first of all XPath selects nodes in an existing document, it does not remove them. And your path //body//* you start with selects all child and descendant elements of the body element. Even if you now add a predicate like //body//*[not(self::script)] that path still selects elements like the li and the center element that are not themselves script elements but which contain a script element. So //body//*[not(self::script)] is the right approach not to select any non-script elements but it does not help if you want for instance the original center element with the script element being removed. That is not something pure XPath can do for you, you would need to move to XSLT to transform the document and that way remove any script elements.

like image 115
Martin Honnen Avatar answered Sep 23 '22 04:09

Martin Honnen


XPath is just a query language for XML documents and as such it cannot alter in any way the XML document(s) that is being queried.

The most convenient way to produce a new XML document that is different from the initial XML document is by using XSLT.

This short and simple XSLT transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="script"/>
</xsl:stylesheet>

when applied on the provided XML document:

<li>
    <!-- Search Google -->
    <center>
        <form action="http://www.google.fr/cse"
              id="cse-search-box" target="_blank">
            <div>
                <input type="hidden" name="cx"
                value="partner-pub-0959382714089534:mw3ssl65jk1"/>
                <input type="hidden" name="ie" value="ISO-8859-1"/>
                <input type="text" name="q" size="31"/>
                <input type="submit" name="sa" value="Rechercher"/>
            </div>
        </form>
        <script type="text/javascript"
        src="http://www.google.com/coop/cse/brand?form=cse-search-box&amp;lang=fr"/>
    </center>
    <!-- Search Google -->
</li>

produces the wanted, correct result:

<li><!-- Search Google -->
   <center>
      <form action="http://www.google.fr/cse" id="cse-search-box" target="_blank">
         <div>
            <input type="hidden" name="cx" value="partner-pub-0959382714089534:mw3ssl65jk1"/>
            <input type="hidden" name="ie" value="ISO-8859-1"/>
            <input type="text" name="q" size="31"/>
            <input type="submit" name="sa" value="Rechercher"/>
         </div>
      </form>
   </center><!-- Search Google -->
</li>
like image 38
Dimitre Novatchev Avatar answered Sep 23 '22 04:09

Dimitre Novatchev