Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How I can parse a WARC file?

Tags:

java

warc

I download the ClueWeb09_English_Sample.warc file from this page then I write the data of the warc file on a text file by using the given code of the following web page. I want to parse the text file to achieve to the content of the pages in the text file but I do not know how should I parse it.Is there any way to parse a warc file without convert it to text?

I want to parse the following text:

WARC/0.18
WARC-Type: warcinfo
WARC-Date: 2009-04-119T12:48:17-0400
WARC-Record-ID: d4360e52-06c3-41c8-bb13-62db3a622ca7
Content-Type: application/warc-fields
Content-Length: 218

software: Nutch 1.0-dev (modified for clueweb09)
isPartOf: clueweb09-
description: clueweb09 crawl with WARC output
format: WARC file version 0.18
conformsTo: http://www.archive.org/documents/WarcFileFormat-0.18.html

WARC/0.18
WARC-Type: response
WARC-Date: 2009-03-67T15:35:34-0700
WARC-Identified-Payload-Type: 
WARC-TREC-ID: clueweb09-en0040-54-00000
WARC-Target-URI: http://www.smartwebby.com/DreamweaverTemplates/templates/business_general_template59.asp
WARC-Warcinfo-ID: d4360e52-06c3-41c8-bb13-62db3a622ca7
WARC-Record-ID: <urn:uuid:721f9a28-6b9a-44c1-bccd-8c7accb514cd>
Content-Type: application/http;msgtype=response
Content-Length: 21064

HTTP/1.1 200 OK
Content-Type: text/html
X-Powered-By: ASP.NET
Server: Microsoft-IIS/6.0
MicrosoftOfficeWebServer: 5.0_Pub
Cache-control: private
Date: Fri, 30 Jan 2009 18:08:20 GMT
Connection: close
Set-Cookie: COOtempname=; path=/
Content-Length: 20807

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">
<html><!-- InstanceBegin template="/Templates/dreamweaver_template.dwt.asp" codeOutsideHTMLIsLocked="false" --><head><!-- InstanceBeginEditable name="doctitle" -->
<title>Template 59 [Business/General] - Sharp Business Template</title><!-- InstanceEndEditable --><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" type="text/css" href="/styles.css"><link rel="stylesheet" type="text/css" href="/templates.css"><!-- InstanceBeginEditable name="pagemetas" --> 
<META content="Category: Business Templates, General Web Templates. Sharp Simple design with warm colors and neat navigation" name="Description">
<META content="Template 59 Business/General Business Templates General Web Templates Sharp Business Template Sharp Simple design with warm colors neat navigation" name="Keywords"><!-- InstanceEndEditable --></head>
<body><div id="header"><a href="http://www.smartwebby.com/"><img src='/images/new/smartwebby_logo.gif' width='206' height='40' alt='Best Web Design, Global Web Designers in Chennai, India' hspace="20" vspace="8" border="0"></a></div>
<div id="hnav" align="right"><div class="bnav"><form action="http://www.smartwebby.com/site_search.asp" id="cse-search-box" class="gsearch"><div><input type="hidden" name="cx" value="partner-pub-7253144749736841:opeytdpqnvq"><input type="hidden" name="cof" value="FORID:9"><input type="hidden" name="ie" value="ISO-8859-1"><input type="text" name="q" size="15"> <input type="submit" name="sa" value="Search"></div></form></div><a href="/clientarea/default.asp" class="bnav"><img src="/images/client_login.jpg" width="40" height="40" border="0" alt="Client Login" vspace="5"></a><a href="https://www.smartwebby.com/view_cart.asp" class="bnav"><img src="/images/view_cart.jpg" width="40" height="40" border="0" alt="View your Shopping Cart" vspace="5"></a><a href="/DreamweaverTemplates/faqs.asp" class="bnav"><img src="/images/help_faqs.jpg" width="40" height="40" border="0" alt="Help &amp; Frequently Asked Questions" vspace="5"></a><a href="mailto:[email protected]" class="bnav"><img src="/images/email_us.jpg" width="40" height="40" border="0" alt="Email Us" vspace="5"></a></div><div id='htabs'><div id='tabbg'><a href='/' class='tab ml'>Home</a> <a href='/services.asp' class='tab ml'>Services</a> <a href='/portfolio.asp' class='tab ml'>Portfolio</a> <a href='/rates.asp' class='tab ml'>Web Design Pricing</a> <span class='tabOn ml'><a href='/DreamweaverTemplates/'>Dreamweaver Templates</a> </span><a href='/web_applications.asp' class='tab ml'>Web Applications</a> <a href='/resources.asp' class='tab ml'>Free Tutorials</a> <a href='/about_us.asp' class='tab ml'>About</a> <a href='/contactus.asp' class='tab ml'>Contact Us</a> <span class='tab nopad'></span></div></div><div id='tablinks'><a href='/DreamweaverTemplates/about.asp'>About our Dreamweaver Templates</a> | <a href='/DreamweaverTemplates/buy_templates.asp'>How to Buy</a> | <a href='/DreamweaverTemplates/'>Catalog</a> | <a href='/DreamweaverTemplates/terms_of_use.asp'>Terms of Use</a> | <a href='/DreamweaverTemplates/customization_guide.asp'>Customization Help</a> | <a href='/DreamweaverTemplates/TemplateCustomizationService.asp'>Customization Service</a> | <a href='/DreamweaverTemplates/FreeDreamweaverTemplates.asp'>Free Dreamweaver Templates</a> | <a href='/DreamweaverTemplates/terms_of_use_free.asp'>Terms of Use (Free)</a></div><div id='hmenu'><div class='mrow'><a href='/DreamweaverTemplates/BeautyFashionTemplates.asp'>Beauty Templates</a> | <a href='/DreamweaverTemplates/BusinessTemplates.asp' class='onpage'>Business Templates</a> | <a href='/DreamweaverTemplates/ChurchChristianTemplates.asp'>Christian Templates</a> | <a href='/DreamweaverTemplates/CSSTemplates.asp'>CSS Templates</a> | <a href='/DreamweaverTemplates/EducationTemplates.asp'>Education Templates</a> | <a href='/DreamweaverTemplates/FamilyPersonalTemplates.asp'>Family Templates</a> | <a href='/DreamweaverTemplates/FlashTemplates.asp'>Flash Templates</a> | <a href='/DreamweaverTemplates/FreeDreamweaverTemplates.asp'>Free Dreamweaver Templates</a></div><div class='mrow'><a href='/DreamweaverTemplates/FoodTemplates.asp'>Food Templates</a> | <a href='/DreamweaverTemplates/GeneralWebTemplates.asp'>General Templates</a> | <a href='/DreamweaverTemplates/GovernmentMilitaryTemplates.asp'>Government Templates</a> | <a href='/DreamweaverTemplates/HealthMedicalTemplates.asp'>Health/Medical Templates</a> | <a href='/DreamweaverTemplates/HiTechTemplates.asp'>Hi-Tech Templates</a> | <a href='/DreamweaverTemplates/KidsChildcareTemplates.asp'>Kids/Childcare Templates</a> | <a href='/DreamweaverTemplates/LowCostTemplates.asp'>Low-cost Budget Templates</a></div><div class='mrow'><a href='/DreamweaverTemplates/PersonalWebTemplates.asp'>Personal Templates</a> | <a href='/DreamweaverTemplates/PetsTemplates.asp'>Pets Templates</a> | <a href='/DreamweaverTemplates/PhotographyTemplates.asp'>Photography Templates</a> | <a href='/DreamweaverTemplates/ProfessionalsTemplates.asp'>Profession Templates</a> | <a href='/DreamweaverTemplates/RealEstateTemplates.asp'>Real Estate Templates</a> | <a href='/DreamweaverTemplates/SportsTemplates.asp'>Sports Templates</a> | <a href='/DreamweaverTemplates/TelecomTemplates.asp'>Telecom Templates</a> | <a href='/DreamweaverTemplates/TravelTemplates.asp'>Travel Templates</a></div></div><div id='hlinks'><a href='/services.asp'>Professional Web Design Services</a> <a href='/design_packages.asp'>Web Design Packages</a> <a href='/professional_logo_designing.asp#packages'>Logo Design Packages</a> <a href='/DreamweaverTemplates/'>Dreamweaver Web Templates</a> <a href='/web_site_design/default.asp'>Web Design Guide</a> <a href='/web_site_design/web_design_tools.asp'>Best Web Design Software</a></div>
<div id="content" class="text"><!-- InstanceBeginEditable name="tempinfo" -->
<h1>Template 59 - Business/General</h1><!-- InstanceEndEditable -->
<div class="picboxL" align="center"><div class="red">Template HTML View Screenshot - 1024px screen width</div><img src="/images/spacer.gif" alt="" width="1" height="15" hspace="0" vspace="0"><br><!-- InstanceBeginEditable name="1024view" --><img src="/images/dreamweaver_templates/HTML_1024_view/temp59_business_general.gif" width="400" height="290" alt="Template 59 [Business/General] - 1024px screen width view"><!-- InstanceEndEditable --><br><a href="javascript:;" onClick="OpenTemplate('temp59_business_general')"><img src="/images/800res.gif" width="184" height="27" border="0" alt="View for 800x600 Resolution"></a> &nbsp; <a href="javascript:;" onClick="OpenTemplate2('temp59_business_general')" class="link"><img src="/images/1024res.gif" width="184" height="27" border="0" alt="View for 1024x768 Resolution"></a></div>
<div class="whiteboxR"><div align="center"><strong> &nbsp; &nbsp; &nbsp; Preview sample web page : </strong> <a href="javascript:;" onClick="OpenTemplate('temp59_business_general')" class="link">For 800x600 Resolution</a> | <a href="javascript:;" onClick="OpenTemplate2('temp59_business_general')" class="link">For 1024x768 Resolution</a></div><div class="curveboxT"><div class="curveboxL"><div class="curveboxR"><div class="curveboxTR"><img src="/images/home/box_t.jpg" alt="" width="51" height="56" align="bottom"></div><div class="curveboxC"><!-- InstanceBeginEditable name="features" --><span class="red">Key Features of this  Sharp Business Template
          <div id="certs" align="center"><img src='/images/icons/w3c_html_valid.gif' alt='W3C Certified: Valid HTML 4.01 Transitional' width='52' height='26'> <img src='/images/icons/top_browsers_tested.gif' alt='Cross Browser Compatible: Tested in IE 5+, Firefox 1+, Opera 7+, Netscape 6+, Safari 3' width='52' height='26'> <img src='/images/icons/drop_down_menus.gif' alt='Javascript Drop-Down Menus' width='52' height='26'> <img src='/images/icons/stretch_layout.gif' alt='Stretch Layout to fit all screen resolutions' width='52' height='26'> <img src='/images/icons/text_links_nav.gif' alt='Text Links Navigation' width='52' height='26'> </div></span>
        <ul>
        <li>Sharp  Simple design with warm colors and neat 
          navigation</li>
        <li>Easy-to-edit Drop-down Menus &amp; Text Links    </li>
         <li>All <b>16</b> linked HTML pages included</li>
          <li>Cross Browser Compatible : <span class='bluelk'>Tested for Internet Explorer 5+, Netscape 6+, Opera 7+, Firefox 1.0+, Safari 3</span></li>
        <li>Designed to stretch and fit all resolutions (800 x 600 and higher screen resolutions)</li>
        </ul>
    <!-- InstanceEndEditable -->
          Buy Now for Only <strong>$39.95</strong>! &nbsp; <a href="/addtocart.asp?pid=198"><img src="/images/add2cart.gif" alt="Add to Cart" width="148" height="38" border="0" align="middle"></a><br><a href="#software" class="link">Software Required</a> <a href="javascript:;" onClick="OpenHelp('software_req')" class="super">[?]</a> &nbsp; <a href="#zip" class="link">Source Files Included</a> <a href="javascript:;" onClick="OpenHelp('source_files')" class="super">[?]</a> &nbsp; <a href='/DreamweaverTemplates/BusinessTemplates.asp' class='link'>More Business Templates</a></div><div class="curveboxB"><div class="curveboxBR"><img src="/images/home/box_l.jpg" width="51" height="30" alt=""></div></div></div></div></div></div>
<div class="bluebox"><div class="bluesub">Why buy our <a href="/DreamweaverTemplates/" target="_blank" class="bluesub">High-Quality Professional Dreamweaver Templates</a>?</div>
<ul class="bullet"><li>Save time &amp; money! Choose from a variety of website designs to find the perfect ready-to-use Adobe Dreamweaver &amp; Fireworks Template for your site. </li><li>Our Dreamweaver Templates are Cross Browser Compatible, Optimized for low load-time and W3C Standard Compliant (valid CSS & HTML code).</li><li>Each dreamweaver template download comes with an easy-to follow Customization Guide that will help to get your web site up within a couple of days!</li>  <li> Fully automated purchase process - Buy and download your Dreamweaver Template instantly on your credit card purchase approval!<script type="text/javascript" language="JavaScript" src="/DreamweaverTemplates/scripts.js"></script></li></ul></div><p align="center" class="red">Template HTML View - Actual Size Screenshot for 800px screen width</p><div align="center"><!-- InstanceBeginEditable name="800view" --><img src="/images/dreamweaver_templates/HTML_view/temp59_business_general.gif" width="790" height="579" alt="Dreamweaver Template 59 [Business/General] - Actual Size Screenshot for 800px screen width"><!-- InstanceEndEditable --></div><div align="center"><br><strong>Template 59 [Business/General] HTML sample web page Screenshot</strong></div><span class="red">Please Note:</span> The above image has been optimized for lower GIF file size, hence some parts of it may look blurred or distorted. View the <a href="javascript:;" onClick="OpenTemplate('temp59_business_general')">template sample web page</a> to look at the actual optimized template graphics without any distortion. <!-- InstanceBeginEditable name="pagedesc" -->In the sample page, mouseover the top horizontal text links to view the drop-down menus effect.<!-- InstanceEndEditable --><br><br><div class="bluebox"><!-- InstanceBeginEditable name="software_zip" --><a name="software"></a><span class="bluesub">Software Required for the customization of Template 59 [Business/General]:</span>
    <ul class="bullet">
      <li>Adobe Dreamweaver (MX 2004 or above)</li>
      <li>Adobe Fireworks (MX 2004 or above)<a name="zip"></a></li>
    </ul>
<span class="bluesub">The Zip Download for Template 59 [Business/General] includes the following files:</span>
    <ul class="bullet">
      <li> The web site design layout source (Fireworks .PNG file)</li>
      <li>The Dreamweaver web template (Dreamweaver .DWT file)</li>
      <li>All <strong>16</strong> HTML pages shown as links in the template <a href="javascript:;" onClick="OpenTemplate('temp59_business_general')">sample web page</a>  (Dreamweaver .HTM files)</li>
      <li>The external cascading style sheet (.CSS file)</li>
      <li>JavaScript files for DHTML effects like drop-down menus, slideshows, etc. (.JS files) </li>
      <li>All graphics &quot;as is&quot; viewable in the template <a href="javascript:;" onClick="OpenTemplate('temp59_business_general')">sample web page</a> (optimized .GIF &amp; .JPG files) </li>
      <li>Our easy-to-follow Customization Guide and the End User License Agreement (EULA)</li>
      <li>A font folder that includes the fonts used in the template Fireworks layout</li>
      </ul>
<!-- InstanceEndEditable --></div>
<div class="picboxDF" align="center"><p class="red">Template Adobe Dreamweaver View - index.htm page</p><!-- InstanceBeginEditable name="DWview" --><img src="/images/dreamweaver_templates/dreamweaver_view/temp59_business_general.gif" width="400" height="290" alt="Template 59 [Business/General] - Adobe Dreamweaver View"><!-- InstanceEndEditable --><p>Template 59 [Business/General] Dreamweaver Screenshot </p></div>
<div class="picboxDF" align="center"><p class="red">Template Adobe Fireworks View - template.png source file </p><!-- InstanceBeginEditable name="FWview" --><img src="/images/dreamweaver_templates/fireworks_view/temp59_business_general.gif" width="400" height="290" alt="Template 59 [Business/General] - Adobe Fireworks View"><!-- InstanceEndEditable --><p>Template 59 [Business/General] Fireworks Screenshot </p></div><h4 align="center" class="clear100">SmartWebby.com Dreamweaver Templates - Categories</h4>
<div align="center" class="grey"><a href="/DreamweaverTemplates/BeautyFashionTemplates.asp" class="grey">Beauty Templates</a> | <a href="/DreamweaverTemplates/BusinessTemplates.asp" class="grey">Business Templates</a> [Pg: <a href="/DreamweaverTemplates/BusinessTemplates.asp" class="grey">1</a>, <a href="/DreamweaverTemplates/DreamweaverBusinessTemplates.asp" class="grey">2</a>, <a href="/DreamweaverTemplates/business_templates.asp" class="grey">3</a>, <a href="/DreamweaverTemplates/dreamweaver_business_templates.asp" class="grey">4</a>] | <a href="/DreamweaverTemplates/ChurchChristianTemplates.asp" class="grey">Christian Templates</a> | <a href="/DreamweaverTemplates/CSSTemplates.asp" class="grey">CSS Templates (tableless)</a> [Pg: <a href="/DreamweaverTemplates/CSSTemplates.asp" class="grey">1</a>, <a href="/DreamweaverTemplates/DreamweaverCSSTemplates.asp" class="grey">2</a>] | <a href="/DreamweaverTemplates/EducationTemplates.asp" class="grey">Education Templates</a>
<br>
<a href="/DreamweaverTemplates/FamilyPersonalTemplates.asp" class="grey">Family/Personal Templates</a> [Pg: <a href="/DreamweaverTemplates/FamilyPersonalTemplates.asp" class="grey">1</a>, <a href="/DreamweaverTemplates/FamilyTemplates.asp" class="grey">2</a>] | <a href="/DreamweaverTemplates/FlashTemplates.asp" class="grey">Flash Templates</a> | <a href="/DreamweaverTemplates/FoodTemplates.asp" class="grey">Food Templates</a> | <a href="/DreamweaverTemplates/FreeDreamweaverTemplates.asp" class="grey">Free Dreamweaver Templates</a> | <a href="/DreamweaverTemplates/GeneralWebTemplates.asp" class="grey">General Templates</a> | <a href="/DreamweaverTemplates/GovernmentMilitaryTemplates.asp" class="grey">Government Templates</a>
<br>
<a href="/DreamweaverTemplates/HealthMedicalTemplates.asp" class="grey">Health/Medical Templates</a> | <a href="/DreamweaverTemplates/HiTechTemplates.asp" class="grey">Hi-Tech Templates
</a> | <a href="/DreamweaverTemplates/KidsChildcareTemplates.asp" class="grey">Kids Templates</a> | <a href="/DreamweaverTemplates/LowCostTemplates.asp" class="grey">Low-cost/Budget Templates</a> [Pg: <a href="/DreamweaverTemplates/LowCostTemplates.asp" class="grey">1</a>, <a href="/DreamweaverTemplates/LowCostBudgetTemplates.asp" class="grey">2</a>] | <a href="/DreamweaverTemplates/PersonalWebTemplates.asp" class="grey">Personal Web Templates</a> | <a href="/DreamweaverTemplates/PetsTemplates.asp" class="grey">Pets/Animals Templates</a>
<br>
<a href="/DreamweaverTemplates/PhotographyTemplates.asp" class="grey">Photography Templates</a> | <a href="/DreamweaverTemplates/ProfessionalsTemplates.asp" class="grey">Professionals Templates</a> | <a href="/DreamweaverTemplates/RealEstateTemplates.asp" class="grey">Real Estate Templates</a> | <a href="/DreamweaverTemplates/SportsTemplates.asp" class="grey">Sports Templates</a> | <a href="/DreamweaverTemplates/TelecomTemplates.asp" class="grey">Telecom Templates</a> | <a href="/DreamweaverTemplates/TravelTemplates.asp" class="grey">Travel Templates</a></div>
</div><div id="clearfloats"></div><div id="fmenu">

<div class='mrow'><strong><a href='/services.asp'>Services</a></strong> &gt; <a href='/web_services/design.asp'>CSS Web Design</a> | <a href='/professional_logo_designing.asp'>Professional Logo Design</a> | <a href='/web_services/web_programming.asp'>ASP.net, ASP &amp; PHP Programming</a> | <a href='/web_services/flash_animation_programming.asp'>Flash Animation &amp; Programming</a> | <a href='/affordable_web_hosting_plans.asp'>Reliable Web Hosting</a> | <a href='/website_maintenance_packages.asp'>Website Maintenance</a></div>
<div class='mrow'><strong><a href='/portfolio.asp'>Portfolio</a></strong> &gt; <a href='/design_portfolio.asp'>Web Design Portfolio</a> | <a href='/programming_portfolio.asp'>Web Programming Portfolio</a> | <a href='/logo_design_portfolio.asp'>Print &amp; Logo Design Portfolio</a> | <a href='/flash_portfolio.asp'>Flash Animation Portfolio</a> | <a href='/outsource_portfolio.asp'>Outsource Clients Portfolio</a> | <a href='/client_quotes.asp'>Client Testimonials</a></div>
<div class='mrow'><strong><a href='/rates.asp'>Web Design Pricing</a></strong> &gt; <a href='/rates.asp'>Design Rates</a> | <a href='/design_packages.asp'>Custom Web Design Pricing</a> | <a href='/professional_logo_designing.asp#packages'>Logo Design Pricing</a> | <a href='/professional_logo_designing.asp#bcl'>Business Card &amp; Letterhead Pricing</a> | <a href='/affordable_web_hosting_plans.asp#windows'>Web Hosting Plans</a> | <a href='/website_maintenance_packages.asp#packages'>Website Maintenance Plans</a></div>
<div class='mrow'><strong><a href='/web_applications.asp'>Web Applications</a></strong> &gt; <a href='/web_products/flash_survey/default.asp'>Smart Survey</a> | <a href='/web_products/flash_poll/multi_poll.asp'>Smart Multi Poll</a> | <a href='/web_products/flash_poll/default.asp'>Smart Poll</a> | <a href='/web_products/flash_guestbook/default.asp'>Smart Guest Book (ASP</a>/<a href='/web_products/PHP/flash_guestbook/default.asp'>PHP</a>) | <a href='/web_products/instant_quote/default.asp'>Smart Quote</a> | <a href='/web_site_design/free_web_tools.asp'>Free Web Applications</a> | <a href='/custom_flash_web_applications.asp'>Custom Flash Applications</a></div>

<div class='mrow'><strong><a href='/resources.asp'>Free Tutorials</a></strong> &gt; <a href='/web_site_design/default.asp'>Web Design Tutorials</a> | <a href='/Flash/default.asp'>Flash Tutorials</a> | <a href='/web_site_design/dreamweaver_template.asp'>Dreamweaver Tutorials</a> | <a href='/web_site_design/dreamweaver_template.asp'>Fireworks Tutorials</a> | <a href='/website_promotion/default.asp'>SEO &amp; Promotion Tutorials</a> | <a href='/DHTML/default.asp'>Javascript Tutorials</a> | <a href='/PHP/default.asp'>PHP MySQL Tutorials</a></div>
</div>
<div id="footer"><div id="footerT" align="center" class="bluedk">Copyright &copy; 2001-2008 Jandus Technologies - <a href="http://www.smartwebby.com/">www.smartwebby.com</a> - All Rights Reserved. &nbsp; &nbsp; <a href="/privacy_policy.asp">Privacy Policy</a> | <a href="/site_map.asp">Site Map</a> | <a href="#">Page Top</a>  &nbsp; &nbsp; &nbsp; <div id="footerR"><img src="/images/new/jandus_technologies_logo.gif" width="127" height="48" alt="Jandus Technologies logo"></div></div>
<div id="footerB"> <img src="/images/new/w3c_css.gif" alt="Valid CSS!" width="53" height="22" align="middle" border="0"> <img src="/images/new/w3c_html.gif" alt="Valid HTML 4.01 Transitional" width="53" height="22" align="middle" border="0"></div></div><script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script><script type="text/javascript">
var pageTracker = _gat._getTracker("UA-536043-1");
pageTracker._trackPageview();
</script></body><!-- InstanceEnd --></html>
like image 428
user3487667 Avatar asked Nov 26 '14 15:11

user3487667


2 Answers

The code that I am looking for is written in this class and project.

like image 83
user3487667 Avatar answered Oct 06 '22 14:10

user3487667


If you are into Python, try the following code. They are small and easy to run. Most importantly, you are able to parse the content.

import boto
from boto.s3.key import Key
from gzipstream import GzipStreamFile
import warc

if __name__ == '__main__':
  # Let's use a random gzipped web archive (WARC) file from the 2014-15 Common Crawl dataset
  ## Connect to Amazon S3 using anonymous credentials
  conn = boto.connect_s3(anon=True)
  pds = conn.get_bucket('aws-publicdatasets')
  ## Start a connection to one of the WARC files
  k = Key(pds)
  k.key = 'common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00000-ip-10-147-4-33.ec2.internal.warc.gz'

  # The warc library accepts file like objects, so let's use GzipStreamFile
  f = warc.WARCFile(fileobj=GzipStreamFile(k))
  for num, record in enumerate(f):
    if record['WARC-Type'] == 'response':
      # Imagine we're interested in the URL, the length of content, and any Content-Type strings in there
      print record['WARC-Target-URI'], record['Content-Length']
      print record.payload.read()
      print '=-=-' * 10
    if num > 100:
      break

Refer to https://github.com/commoncrawl/gzipstream/blob/master/examples/streaming_commoncrawl_from_s3.py for the original code.

If you need more help, refer to http://bibnum.bnf.fr/WARC/ for more information.

like image 21
Derek Chia Avatar answered Oct 06 '22 12:10

Derek Chia