<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>bixo-dev at Yahoo! Groups</title>
    <link>http://groups.yahoo.com/group/bixo-dev/</link>
    <description>Bixo Web Mining Toolkit</description>

    <item>
      <title>Re: Advise for fetcher policy</title>
      <pubDate>Wed, 22 May 2013 17:10:50 GMT</pubDate>
      <dc:creator>Ken Krugler</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1321</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1321</guid>
      <description>... A 404 should result in UrlStatus.HTTP_NOT_FOUND What you do with those entries in the crawlDB is up to your processing code. In the DemoCrawlTool, the</description>
    </item>
    <item>
      <title>Re: Advise for fetcher policy</title>
      <pubDate>Wed, 22 May 2013 16:55:51 GMT</pubDate>
      <dc:creator>Pat Ferrel</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1320</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1320</guid>
      <description>The miner is getting urls to 404s. So Pinterest is allowing removal of pages but leaving the links to those pages around. If leaving them in the crawldb marked</description>
    </item>
    <item>
      <title>Re: Advise for fetcher policy</title>
      <pubDate>Wed, 22 May 2013 16:35:48 GMT</pubDate>
      <dc:creator>Ken Krugler</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1319</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1319</guid>
      <description>... One other thought - we do &quot;batch&quot; fetching of URLs, using keep-alive to optimize the connection that we create with the server. Pinterest might not like</description>
    </item>
    <item>
      <title>Re: Advise for fetcher policy</title>
      <pubDate>Wed, 22 May 2013 15:06:28 GMT</pubDate>
      <dc:creator>Pat Ferrel</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1318</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1318</guid>
      <description>Again, thanks. You are the only one I know with crawler-fu skills. I&#39;ll take a look at the headers in the simple fetcher. I suspect that Pinterest is not</description>
    </item>
    <item>
      <title>Re: Advise for fetcher policy</title>
      <pubDate>Wed, 22 May 2013 13:53:03 GMT</pubDate>
      <dc:creator>Ken Krugler</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1317</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1317</guid>
      <description>Hi Pat, ... In bixo there&#39;s a FetchAndParseTool that you can use to fetch individual URLs. I&#39;d try that, as another way to test. Some random ideas? </description>
    </item>
    <item>
      <title>Re: Advise for fetcher policy</title>
      <pubDate>Tue, 21 May 2013 23:58:15 GMT</pubDate>
      <dc:creator>Pat Ferrel</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1316</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1316</guid>
      <description>Thanks. With the below settings I also changed the retry to 0 and the log level to trace. It looks like perfectly good urls are not getting fetched. Thinking I</description>
    </item>
    <item>
      <title>Re: Advise for fetcher policy</title>
      <pubDate>Fri, 17 May 2013 22:00:13 GMT</pubDate>
      <dc:creator>Ken Krugler</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1315</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1315</guid>
      <description>Hi Pat, ... It will try to fetch every URL, but it will only make one HttpClient request for each URL. HttpClient will retry multiple times, and if the server</description>
    </item>
    <item>
      <title>Advise for fetcher policy</title>
      <pubDate>Fri, 17 May 2013 14:00:08 GMT</pubDate>
      <dc:creator>Pat Ferrel</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1314</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1314</guid>
      <description>Hi guys, I&#39;m back to crawling Pinterest to update my experimental recommender. I created a merged miner/crawler, which was working fine if slowly. I added an</description>
    </item>
    <item>
      <title>Re: building bixo</title>
      <pubDate>Thu, 09 May 2013 18:00:37 GMT</pubDate>
      <dc:creator>Pat Ferrel</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1313</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1313</guid>
      <description>Hmm, comment out the test and it completes without errors. Maybe openDNS is the problem? On May 9, 2013, at 10:31 AM, Pat Ferrel &lt;pat.ferrel@...&gt; wrote: </description>
    </item>
    <item>
      <title>building bixo</title>
      <pubDate>Thu, 09 May 2013 17:31:10 GMT</pubDate>
      <dc:creator>Pat Ferrel</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1312</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1312</guid>
      <description>It&#39;s been awhile since I did a new build of bixo. For some reason, though I haven&#39;t changed the code, I&#39;m getting all sorts of test errors. I was getting an</description>
    </item>
    <item>
      <title>Re: democrawler - domain</title>
      <pubDate>Thu, 04 Apr 2013 23:41:08 GMT</pubDate>
      <dc:creator>Ken Krugler</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1311</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1311</guid>
      <description>Hi Jeff, ... By default if you provide a -domain parameter, then URL filtering is set up such that only URLs for that domain are accepted (all other URLs are</description>
    </item>
    <item>
      <title>Double-normalization of URLs?</title>
      <pubDate>Thu, 04 Apr 2013 23:36:23 GMT</pubDate>
      <dc:creator>Ken Krugler</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1310</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1310</guid>
      <description>Hi Vivek, I was looking at the DemoCrawlWorkflow source, and noticed this snippet: Pipe urlFromOutlinksPipe = new Pipe(&quot;url from outlinks&quot;,</description>
    </item>
    <item>
      <title>democrawler - domain</title>
      <pubDate>Tue, 02 Apr 2013 17:22:07 GMT</pubDate>
      <dc:creator>jeffjeffrsn</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1309</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1309</guid>
      <description>Hi Eeryone, I noticed, that the democrawler stays at one domain. ... I&#39;ve got the domain example.com. At this domain there are outlinks to test.example.com,</description>
    </item>
    <item>
      <title>Re: html content - parsePipe.getTailPipe() (DemoCrawlTool)</title>
      <pubDate>Tue, 02 Apr 2013 17:02:54 GMT</pubDate>
      <dc:creator>jeffjeffrsn</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1308</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1308</guid>
      <description>Hi Chris, Thanks for the answer. Now I&#39;m subclassing the baseparser. Thanks, - Jeff</description>
    </item>
    <item>
      <title>Re: html content - parsePipe.getTailPipe() (DemoCrawlTool)</title>
      <pubDate>Tue, 02 Apr 2013 14:12:42 GMT</pubDate>
      <dc:creator>Chris Schneider</dc:creator>
      <link>http://groups.yahoo.com/group/bixo-dev/message/1307</link>
      <guid isPermaLink="true">http://groups.yahoo.com/group/bixo-dev/message/1307</guid>
      <description>Hi Jeff, I am not sure what you meant when you wrote &quot;added a new Pipe to the tail of the parsePipe&quot;. If you did add a tail pipe containing only</description>
    </item>

  </channel>
</rss>
<!-- rss1.grp.bf1.yahoo.com uncompressed Fri May 24 05:24:03 PDT 2013 -->
