<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>archive-crawler at Yahoo! Groups</title>
    <link>http://tech.groups.yahoo.com/group/archive-crawler/</link>
    <description>archive-crawler</description>

    <item>
      <title>Recrawling In Heritrix3</title>
      <pubDate>Sun, 08 Nov 2009 22:05:18 GMT</pubDate>
      <dc:creator>Matthew Warhaftig</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6140</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6140</guid>
      <description>Hi, In H3 I am trying to setup crawl jobs that use FetchHistoryProcessor/ PersistStoreProcessor/PersistLoadProcessor to discard duplicate content.  I can get</description>
    </item>
    <item>
      <title>Re: adding new profiles at build time</title>
      <pubDate>Tue, 03 Nov 2009 00:06:07 GMT</pubDate>
      <dc:creator>pbaclace</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6139</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6139</guid>
      <description>After some testing I determined that conf/profiles is created lazily if either a new profile is created or if the default profile is edited in the web UI. To</description>
    </item>
    <item>
      <title>Re: Question about QueueOverbudgetDecideRule</title>
      <pubDate>Mon, 02 Nov 2009 12:40:07 GMT</pubDate>
      <dc:creator>olintocattaneo</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6138</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6138</guid>
      <description>Replying to myself just in case anyone competent missed this. Olinto</description>
    </item>
    <item>
      <title>Re: wrong document &quot;crawl-order&quot; Heritrix</title>
      <pubDate>Sat, 31 Oct 2009 18:12:24 GMT</pubDate>
      <dc:creator>parseram34</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6137</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6137</guid>
      <description>I was reebooting and now the http://127.0.0.1:8080 address shows &quot;Failed to connect&quot;. http://localhost:8080 doesnt work either. When I start the terminal its</description>
    </item>
    <item>
      <title>SV: [archive-crawler] Heritrix 3.0.0-beta test release now available</title>
      <pubDate>Fri, 30 Oct 2009 16:06:18 GMT</pubDate>
      <dc:creator>Søren Vejrup Carlsen</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6136</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6136</guid>
      <description>Hi Gordon. I can&#39;t find the tool to migrate 1.X configurations to 3.X style configurations. I have downloaded the heritrix-3.0.0-beta-dist.tar.gz from</description>
    </item>
    <item>
      <title>Re: wrong document &quot;crawl-order&quot; Heritrix</title>
      <pubDate>Thu, 29 Oct 2009 19:36:07 GMT</pubDate>
      <dc:creator>Gordon Mohr</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6135</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6135</guid>
      <description>When and where does this error appear? (For example: at the time Heritrix is launched, at the time you try to start a crawl, at the time you edit settings,</description>
    </item>
    <item>
      <title>wrong document &quot;crawl-order&quot; Heritrix</title>
      <pubDate>Thu, 29 Oct 2009 10:02:17 GMT</pubDate>
      <dc:creator>parseram34</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6134</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6134</guid>
      <description>Could you please help me..I get the following Error message in Heritrix: Wrong document type &#39;Crawl-order&#39; in</description>
    </item>
    <item>
      <title>Re: analysing progress-statistics.log</title>
      <pubDate>Thu, 29 Oct 2009 00:55:29 GMT</pubDate>
      <dc:creator>Gordon Mohr</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6133</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6133</guid>
      <description>The &#39;queued&#39; URIs are almost certainly on some hosts that are not responding. Heritrix is trying them every 15 minutes, but then putting them back on the queue</description>
    </item>
    <item>
      <title>analysing progress-statistics.log</title>
      <pubDate>Wed, 28 Oct 2009 14:46:44 GMT</pubDate>
      <dc:creator>Pranay Pandey</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6132</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6132</guid>
      <description>Hi, I had set up a crawl job to run for 6 hours using H3-beta. I had it configured to be least polite and number of parallel queue was set to 5. After</description>
    </item>
    <item>
      <title>Re: WARC Implementation</title>
      <pubDate>Tue, 27 Oct 2009 16:16:09 GMT</pubDate>
      <dc:creator>steve@...</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6131</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6131</guid>
      <description>hi Roger, the latest versions of Heritrix deliver warc output in format: &quot;WARC File Format 1.0&quot; which conforms to the ISO 28500 specification, an ISO standard</description>
    </item>
    <item>
      <title>WARC Implementation</title>
      <pubDate>Tue, 27 Oct 2009 16:05:58 GMT</pubDate>
      <dc:creator>Coram, Roger</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6130</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6130</guid>
      <description>Hi, Is there any documentation on the Heritrix implementation of WARC beyond just the source code? i.e. elements from the specification in-/excluded, which</description>
    </item>
    <item>
      <title>Re: Eliminating Crawl Garbage</title>
      <pubDate>Tue, 27 Oct 2009 16:05:39 GMT</pubDate>
      <dc:creator>steve@...</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6129</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6129</guid>
      <description>hi Tram, can you be more specific about what you consider &quot;junk&quot;? the default profile includes a TransclusionDecideRule which tells the crawler to transitively</description>
    </item>
    <item>
      <title>Which web crawler pre select links before downloading and convert in</title>
      <pubDate>Tue, 27 Oct 2009 15:44:08 GMT</pubDate>
      <dc:creator>alphonse.smith16</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6128</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6128</guid>
      <description>Hello all, I am new on this group. I am looking for a web crawler which can get list of links of a webpage and convert a website downloaded in mht file. It</description>
    </item>
    <item>
      <title>Eliminating Crawl Garbage</title>
      <pubDate>Fri, 23 Oct 2009 02:04:50 GMT</pubDate>
      <dc:creator>tristram.bethea</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6127</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6127</guid>
      <description>Hello guys, I&#39;ve been using heritrix to do some crawls with about 10 seeds.  I find that I am getting excessively large amounts of trash in the data I collect.</description>
    </item>
    <item>
      <title>Question about QueueOverbudgetDecideRule</title>
      <pubDate>Thu, 22 Oct 2009 20:31:58 GMT</pubDate>
      <dc:creator>olintocattaneo</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6126</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6126</guid>
      <description>Hello I&#39;m trying to get QueueOverbudgetDecideRule to work but I don&#39;t seem to be able to do this. Is this module still functional or maybe I have added it to a</description>
    </item>

  </channel>
</rss>
<!-- wr1.grp.sp2.yahoo.com uncompressed/chunked Sun Nov  8 21:58:24 PST 2009 -->
