Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do i exclude everything but text/html from a heritrix crawl?

On: Heritrix Usecases there is an Use Case for "Only Store Successful HTML Pages"

My Problem: i dont know how to implement it in my cxml File. Especially: Adding the ContentTypeRegExpFilter to the ARCWriterProcessor => set its regexp setting to text/html.*. ... There is no ContentTypeRegExpFilter in the sample cxml Files.

like image 832
dgAlien Avatar asked Feb 26 '23 11:02

dgAlien


2 Answers

Kris's answer is only half the truth (at least with Heritrix 3.1.x that I'm using). A DecideRule return ACCEPT, REJECT or NONE. If a rule returns NONE, it means that this rule has "no opinion" about that (like ACCESS_ABSTAIN in Spring Security). Now ContentTypeMatchesRegexDecideRule (as all other MatchesRegexDecideRule) can be configured to return a decision if a regex matches (configured by the two properties "decision" and "regex"). The setting means that this rule returns an ACCEPT decision if the regex matches, but returns NONE if it does not match. And as we have seen - NONE is not an opinion so that shouldProcessRule will evaluate to ACCEPT because no decisions have been made.

So to only archive responses with text/html* Content-Type, configure a DecideRuleSequence where everything is REJECTed by default and only selected entries will be ACCEPTed.

This looks like this:

 <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
   <property name="shouldProcessRule">
     <bean class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="rules">
         <list>
           <!-- Begin by REJECTing all... -->
           <bean class="org.archive.modules.deciderules.RejectDecideRule" />
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
             <property name="decision" value="ACCEPT" />
             <property name="regex" value="^text/html.*" />
           </bean>
         </list>
       </property>
     </bean>
   </property>
   <!-- other properties... -->
 </bean>

To avoid that images, movies etc. are downloaded at all, configure the "scope" bean with a MatchesListRegexDecideRule that REJECTs urls with well known file extensions like:

<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
      <property name="decision" value="REJECT"/>
      <property name="listLogicalOr" value="true" />
      <property name="regexList">
       <list>
         <value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value>
         <value>.*(?i)(\.(rar|zip|tar|gz))$</value>
         <value>.*(?i)(\.(pdf|doc|xls|odt))$</value>
         <value>.*(?i)(\.(xml))$</value>
         <value>.*(?i)(\.(txt|conf|pdf))$</value>
         <value>.*(?i)(\.(swf))$</value>
         <value>.*(?i)(\.(js|css))$</value>
         <value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
       </list>
      </property>
</bean>
like image 54
James Avatar answered May 01 '23 00:05

James


The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.

The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule

A possible ARCWriter configuration:

  <bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
    <property name="shouldProcessRule">
      <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="ACCEPT" />
        <property name="regex" value="^text/html.*">
      </bean>
    </property>
    <!-- Other properties that need to be set ... -->
  </bean>

This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.

Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).

As shouldProcessRule is inherited from Processor, this can be applied to any processor.

More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)

like image 35
Kris Avatar answered May 01 '23 01:05

Kris