<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>archive-crawler at Yahoo! Groups</title>
    <link>http://tech.groups.yahoo.com/group/archive-crawler/</link>
    <description>archive-crawler</description>

    <item>
      <title>crawler-commons project</title>
      <pubDate>Fri, 20 Nov 2009 16:48:19 GMT</pubDate>
      <dc:creator>stack</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6161</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6161</guid>
      <description>Hey crawlers: I was Apachecon in Oakland in early November and was present during a meeting of a few of the open source crawler projects (Ken Krugle for Bixo, </description>
    </item>
    <item>
      <title>SV: [archive-crawler] Re: Avoiding overloading webservers hosting ma</title>
      <pubDate>Thu, 19 Nov 2009 16:44:59 GMT</pubDate>
      <dc:creator>Bjarne Andersen</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6160</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6160</guid>
      <description>Our main challenge is that we need the queues sperated on TLD (foo.com and bar.com) to use the quota-enforcer to limit number of bytes on each TLD but at the</description>
    </item>
    <item>
      <title>Re: Avoiding overloading webservers hosting many virtual servers</title>
      <pubDate>Thu, 19 Nov 2009 16:08:34 GMT</pubDate>
      <dc:creator>kristsi25</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6159</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6159</guid>
      <description>John pretty much sums up the problem. The way I&#39;ve dealt with this has been on a case by case basis. Each time we detect this situation, we override the</description>
    </item>
    <item>
      <title>Re: Avoiding overloading webservers hosting many virtual servers</title>
      <pubDate>Thu, 19 Nov 2009 15:41:02 GMT</pubDate>
      <dc:creator>John Lekashman</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6158</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6158</guid>
      <description>Hi, It is not actually possible to guarantee this. There is no real way to distinguish for sure where the actual physical hardware that hosts a name is. There</description>
    </item>
    <item>
      <title>Avoiding overloading webservers hosting many virtual servers</title>
      <pubDate>Thu, 19 Nov 2009 15:03:04 GMT</pubDate>
      <dc:creator>Søren Vejrup Carlsen</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6157</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6157</guid>
      <description>Hi all. Is it possible in Heritrix 1.14.3 to avoid overloading webservers hosting many virtual servers. We currently have the problem that those webservers</description>
    </item>
    <item>
      <title>Re: wrong document &quot;crawl-order&quot; Heritrix</title>
      <pubDate>Thu, 19 Nov 2009 08:01:02 GMT</pubDate>
      <dc:creator>parseram34</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6156</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6156</guid>
      <description>I reinstalled Heritrix and got the same message again after configuring my first job: Wrong document type &#39;crawl-order&#39; in</description>
    </item>
    <item>
      <title>Re: Contributing code to Heritrix?</title>
      <pubDate>Wed, 18 Nov 2009 15:57:02 GMT</pubDate>
      <dc:creator>Tomas Ukkonen</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6155</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6155</guid>
      <description>Hi Kris, Thank you for your reply. I will use JIRA as you suggested attach patches against the latest 1.14.x revision from the Heritrix repository. Regards, --</description>
    </item>
    <item>
      <title>Internet Archive needs a world wide crawl tech lead</title>
      <pubDate>Tue, 17 Nov 2009 23:15:22 GMT</pubDate>
      <dc:creator>Alexis</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6154</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6154</guid>
      <description>Hi, IA is hiring a tech lead for our new web crawling initiative.  You can read the job description here: http://www.archive.org/about/webjobs.php#wwcengineer </description>
    </item>
    <item>
      <title>IDN support of heritrix</title>
      <pubDate>Mon, 16 Nov 2009 11:27:36 GMT</pubDate>
      <dc:creator>takeru sasaki</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6153</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6153</guid>
      <description>Hi, I want to know about IDN support of heritrix. (&quot;Internationalized domain name&quot; http://en.wikipedia.org/wiki/Internationalized_Domain_Name) I was tried to</description>
    </item>
    <item>
      <title>Re: Contributing code to Heritrix?</title>
      <pubDate>Mon, 16 Nov 2009 10:38:47 GMT</pubDate>
      <dc:creator>kristsi25</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6152</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6152</guid>
      <description>I&#39;d suggest opening up an issue (probably multiple issues in your case) in the crawler issue base (https://webarchive.jira.com/browse/HER) and to then attach</description>
    </item>
    <item>
      <title>Re: Recrawling In Heritrix3</title>
      <pubDate>Sat, 14 Nov 2009 02:07:16 GMT</pubDate>
      <dc:creator>Matthew Warhaftig</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6151</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6151</guid>
      <description>Good advice, thank you Gordon.  Adding the recrawl processors to the chain bean and pointing PersistLoadProcessor directly to my existing history (no preload)</description>
    </item>
    <item>
      <title>Contributing code to Heritrix?</title>
      <pubDate>Fri, 13 Nov 2009 11:38:40 GMT</pubDate>
      <dc:creator>Tomas Ukkonen</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6150</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6150</guid>
      <description>Hi In National Library of Finland we have made some improvements to Heritrix: - document classification (content-based) deciderules support (e.g.</description>
    </item>
    <item>
      <title>Re: heritrix as a spider library?</title>
      <pubDate>Fri, 13 Nov 2009 10:21:07 GMT</pubDate>
      <dc:creator>raffaele messuti</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6149</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6149</guid>
      <description>... not in java, but ruby:  http://anemone.rubyforge.org/ just use from shell: $ anemone url-list http://crawler.archive.org &gt; seeds.txt -- </description>
    </item>
    <item>
      <title>heritrix as a spider library?</title>
      <pubDate>Thu, 12 Nov 2009 16:13:35 GMT</pubDate>
      <dc:creator>pierce403</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6148</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6148</guid>
      <description>I am looking for a simple way to spider web pages from within an app I am working on.  I know heritrix is not intended to be used as a library, but would using</description>
    </item>
    <item>
      <title>Re: heritrix2 bad html parsing?</title>
      <pubDate>Tue, 10 Nov 2009 22:54:08 GMT</pubDate>
      <dc:creator>Gordon Mohr</dc:creator>
      <link>http://tech.groups.yahoo.com/group/archive-crawler/message/6147</link>
      <guid isPermaLink="true">http://tech.groups.yahoo.com/group/archive-crawler/message/6147</guid>
      <description>Heritrix cannot execute Javascript, so its link-extraction with respect to Javascript uses a crude heuristic of trying strings that might be relative URIs</description>
    </item>

  </channel>
</rss>
<!-- wr1.grp.sp2.yahoo.com uncompressed/chunked Fri Nov 20 18:14:57 PST 2009 -->
