RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Web crawling software - Summary

来源: 作者:互联网作品 时间:2007-02-11 点击:

Hi,

Thanks to all who contributed to my initial question: "What software
package do you recommend for performing a large web crawl on
"off-the-shelf" hardware?"

The list of recommended software is below, with the number of
references indicated:

* (4) Heritrix - http://crawler.archive.org/
* (2) WIRE - http://www.cwr.cl/projects/WIRE/
* (1) Labrador - http://www.dcs.gla.ac.uk/~craigm/labrador/
* (1) Poacher (could not find the URL)
* (1) Larbin - http://larbin.sourceforge.net/index-eng.html

数据挖掘研究院


* (1) Nutch - http://lucene.apache.org/nutch/
* (1) wget - http://www.gnu.org/software/wget/
* (1) WebQL - http://www.ql2.com/academic

Below are all the answers unedited.
Thanks again to all who contributed.

Sérgio Nunes

---
From: Hemant Joshi

Have you looked at Labrador on line at
http://www.dcs. gla.ac.uk/~craigm/labrador/ 数据挖掘研究院
--
From: Dragomir R. Radev

Have you used poacher?
--
From: haozhi Ye

Larbin is really good for high performance crawling and I have used it
to crawl 10 Million web pages with several machines in a week.

http://larbin.sourceforge.net/index-eng.html

Another one may be Heritrix, but I've never tried it. It's written in
Java so I doubt its performance. http://crawler.archive.org/
--
From: Filippo Menczer

I have compiled a list of resources for a course on Web mining, and
the list includes pointers to a few off-the-shelf crawlers such as
Heritrix: http://informatics.indiana.edu/fil/Class/b659/resources.html#sw 数据挖掘研究院
--
From: ChaTo (Carlos Alberto-Alejandro CASTILLO-Ocaranza)

Nutch -- Java, written to work with Lucene, has a partnership with
Apache Foundation

Heritrix -- Java, written for the Internet Archive

WIRE -- C++, written for Web characterization studies [by our group at
CWR.cl]

Check also this list, it's quite complete:
http://en.wikipedia.org/wiki/Web_crawler#Examples_of_Web_crawlers
--
From: Milad Shokouhi

You may also want to have a look at WGET:
http://www.gnu.org/software/wget/
--
From: Nicholas Kushmerick

hi Sergio, you might be interested in WebQL, an industrial-strength Web
data extraction/integration tool from QL2 Software. Among other features,

数据挖掘实验室


WebQL has built-in Web crawling capabilities.

WebQL licenses are available free of charge for non-commercial academic
research/education activities -- see www.ql2.com/academic for details.
--
From: Michel Beigbeder

Never tested any one but I've heard about
heritrix http://crawler.archive.org/
wire http://www.cwr.cl/projects/WIRE/
I think that Nutch/Lucene also embed a Web crawler.

(Otherwise I used wget and curl but they have not the same goal, I
suppose you know them.)

You should have a look to Castillo's PhD:
http://www.chato.cl/534/article-63160.html 数据挖掘研究院
最新评论共有 1 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?