Status
This page is related to research on Search Engine Spamming at Yahoo! Research Barcelona. Currently we are hosting a reference collection for Web Spam Research, that is being used for the Web Spam Challenge 2007.
Datasets
The goal of our dataset activity is to make available reference collections that should be:
- Large: the collections should include many examples of spam and non-spam content.
- Clean: the collections should contain little classification errors.
- Uniform: the collections should represent a uniform random sample over a set of pages or hosts.
- Broad: the collections should include as many different Web spam aspects as possible.
- Open: the collections should be freely available for researchers.
A first such collection was generated at the Università di Roma "La Sapienza" and is currently hosted by Yahoo! Research Barcelona. See datasets >>.
Code
There is some source code available, corresponding to Truncated PageRank and Adaptive Estimation of Supporters, the algorithms proposed in a WebKDD'06 paper. 数据挖掘研究院

