Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use sed (or awk) for link canonicalization, fetching the filename?

Tags:

bash

sed

awk

I have a 200 pages site and would like to implement the canonicalization of links.

I use my ftp client to download the site into a local directory and would like to have the canonical meta tag right under the <head> tag for each page.

So, for page 1, i would like to transform

<head>

into

<head>
<link rel="canonical" href="http://www.site.com/page1.htm" />

and use sed to do it within the whole local directory (page1.htm, page2.htm... page200.htm). Thank you.

like image 342
Sergiof4 Avatar asked Nov 19 '25 14:11

Sergiof4


1 Answers

sed, awk are not designed to treat HTML. See RegEx match open tags except XHTML self-contained tags

Demo using xslt, bash, xmlstarlet

cd /where/HTML_pages/exists
for file in *html; do xmlstarlet transform --html <(cat<<EOF
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
    <xsl:output method="html" encoding="utf-8"/>
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>
     <xsl:template match="head">
         <xsl:copy>
             <xsl:apply-templates/>
             <xsl:if test="not(link)">
                 <link rel="canonical" href="http://www.site.com/$file" />
             </xsl:if>
         </xsl:copy>
     </xsl:template>
 </xsl:stylesheet>
EOF) >/"tmp/$file" "$file" && mv "/tmp/$file" "$file"
done

Edit

an even better/proper pure xslt solution still using xmlstarlet but now bash is no more mandatory :

file xsl.xslt :

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <xsl:output method="html" encoding="utf-8" />
   <!-- where are not making a HTML from scratch,
         so we will copy what's exists -->
   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()" />
      </xsl:copy>
   </xsl:template>
   <!-- looking for "head" tag -->
   <xsl:template match="head">
      <xsl:copy>
         <xsl:apply-templates />
         <!-- if "link" tag not exists ... -->
         <xsl:if test="not(link)">
            <!-- we add the new "link" tag... -->
            <link>
               <xsl:attribute name="rel">
                  <!-- with a fixed string attribute... -->
                  <xsl:text>canonical</xsl:text>
               </xsl:attribute>
               <xsl:attribute name="href">
                  <!-- and a dynamic string attribute ("link" parameter) -->
                  <xsl:value-of select="$link" />
               </xsl:attribute>
            </link>
         </xsl:if>
      </xsl:copy>
   </xsl:template>
</xsl:stylesheet>

shell code :

cd /where/HTML_pages/exists
for file in *html; do
    xmlstarlet transform \
        --html \
        xsl.xslt \
        -s "link=http://www.site.com/$file" "$file" > "/tmp/$file" &&
            mv "/tmp/$file" "$file"
done

That will add the element you want in <head> with the current page as variable

like image 129
14 revs, 2 users 81%Gilles Quenot Avatar answered Nov 22 '25 05:11

14 revs, 2 users 81%Gilles Quenot



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!