Hi,
Thanks to all who contributed to my initial question: "What software
package do you recommend for performing a large web crawl on
"off-the-shelf" hardware?"
The list of recommended software is below, with the number of
references indicated:
* (4) Heritrix - http://crawler.archive.org/
* (2) WIRE - http://www.cwr.cl/projects/WIRE/
* (1) Labrador - http://www.dcs.gla.ac.uk/~craigm/labrador/
* (1) Poacher (could not find the URL)
* (1) Larbin - http://larbin.sourceforge.net/index-eng.html
* (1) Nutch - http://lucene.apache.org/nutch/
* (1) wget - http://www.gnu.org/software/wget/
* (1) WebQL - http://www.ql2.com/academic
Below are all the answers unedited.
Thanks again to all who contributed.
Sérgio Nunes
---
From: Hemant Joshi
Have you looked at Labrador on line at
http://www.dcs. gla.ac.uk/~craigm/labrador/ 数据挖掘研究院
--
From: Dragomir R. Radev
Have you used poacher?
--
From: haozhi Ye
Larbin is really good for high performance crawling and I have used it
to crawl 10 Million web pages with several machines in a week.
http://larbin.sourceforge.net/index-eng.html
Another one may be Heritrix, but I've never tried it. It's written in
Java so I doubt its performance. http://crawler.archive.org/
--
From: Filippo Menczer
I have compiled a list of resources for a course on Web mining, and
the list includes pointers to a few off-the-shelf crawlers such as
Heritrix: http://informatics.indiana.edu/fil/Class/b659/resources.html#sw 数据挖掘研究院
--
From: ChaTo (Carlos Alberto-Alejandro CASTILLO-Ocaranza)
Nutch -- Java, written to work with Lucene, has a partnership with
Apache Foundation
Heritrix -- Java, written for the Internet Archive
WIRE -- C++, written for Web characterization studies [by our group at
CWR.cl]
Check also this list, it's quite complete:
http://en.wikipedia.org/wiki/Web_crawler#Examples_of_Web_crawlers
--
From: Milad Shokouhi
You may also want to have a look at WGET:
http://www.gnu.org/software/wget/
--
From: Nicholas Kushmerick
hi Sergio, you might be interested in WebQL, an industrial-strength Web
data extraction/integration tool from QL2 Software. Among other features,
WebQL has built-in Web crawling capabilities.
WebQL licenses are available free of charge for non-commercial academic
research/education activities -- see www.ql2.com/academic for details.
--
From: Michel Beigbeder
Never tested any one but I've heard about
heritrix http://crawler.archive.org/
wire http://www.cwr.cl/projects/WIRE/
I think that Nutch/Lucene also embed a Web crawler.
(Otherwise I used wget and curl but they have not the same goal, I
suppose you know them.)
You should have a look to Castillo's PhD:
http://www.chato.cl/534/article-63160.html 数据挖掘研究院

