Tasks and how to evaluate them while making use of link information
This post is primarily reconstructed from a comment I made on Stephen Green's weblog, which I thought would be good to recreate here.
I think there may be some confusion in Stephen Green's recent post arising from his reading a paper by Nick Craswell, Dave Hawking and Steve Robertson that I'll hopefully try to clear up. Nick Craswell is actually in town (on holidays from Cambridge) and I've chatted to him about this subsequent to writing the original comment. He had some additional thoughts on the matter as well.
At TREC, the Information Retrieval / Enterprise Search group at CSIRO coordinated a series of tracks involving search tasks over large scale and collections of Web data, from about 1997 to 2004.
During that time, we evolved our understanding of the interplay between classic ad hoc information retrieval evaluation methods, metrics and tasks and those appropriate to the Web.
Early on, we applied classic ad hoc retrieval TREC methods and metrics and tasks (referred to as subject seach tasks in the 2001 SIGIR paper of Craswell, Hawking and Robertson) to this Web data. An example subject search task might be captured by a short topic such as "channel tunnel".
Over time, we discovered that link-based methods did not provide great benefit for such subject search tasks, when evaluated using classic measures (such as average precision, which requires large numbers of relevance judgements for each query). Classic ad hoc track evaluation methods use a human assessor to form binary judgements (relevant/not relevant) over a large pool of documents, and across a set of queries.
In fact, there were two things wrong here: subject search tasks are only occasionally performed on the Web, and average precision is not a good metric for most of the kinds of tasks that do get performed.
The 2001 SIGIR paper investigates a particular kind of task (home page finding, which is really a subset of known item finding), and determines that the algorithm (Okapi/BM25) which had done so well at TREC on pure content (subject search) can also do really well over documents represented just by the combination of anchor text for that document's URL, when applied to home page finding tasks.
By 2004, the TREC Web track was considering home page finding, known page finding (specific pages which aren't home pages), and topic distillation (providing a set of relevant home pages about the topic), which is believed to be far more indicative of a lot of Web search activities.
The evaluation metric for known item search is typically the mean reciprocal rank of the first relevant item (and often there is exactly one and only relevant item, as remarked by Jeremy). Topic distallation can be judged with a number of metrics. Part of the lessons from TREC are that you may need to use multiple metrics, since different metrics have different outcomes. For more information see the overview of the TREC-2004 Web track.
So to return to Stephen's original post: the 18.5 million VLC2 collection is just fine to use for link-based methods. But it's all about what you are measuring and how you are measuring it. So 18.5 million pages aren't sufficient to show benefit for link-based methods on subject search tasks evaluated using average precision.
To comment briefly on Jeremy's analysis, it's not clear that Brin and Page's 1998 paper addresses evaluation, other than informally. In fact, they state that an "extensive user study or results analysis" is out of scope. They do have many good points about the difficulties of academic evaluation of Web search. Our group has a set of papers relating to Web search evaluation that may be of interest in this arena.
In discussions with Nick, he felt that WT2g, the smallest of all the Web collections we put together, could even be sufficient for making use of link information. However, it was never used with a series of tasks/evaluation metrics which could demonstrate this, so it's only a conjecture. There was some work by Nick and Dave Hawking which did examine link information available however, reported in their chapter in the TREC book, which seemed to support this.

