3.1 Overview of research
(4/10/05) Here is a quote from Albert Einstein that puts quantitative research into a good perspective: "Not everything that can be counted counts, and not everything that counts can be counted." (http://www.quotationspage.com/quote/26950.html)3.1.1 Comparative research
3.1.2 Other types of research
3.2 Classifications of evaluation
(4/12/04) Two other classifications of evaluation pertinent to information retrieval (IR) worth mentioning. Patton′s (1990) well-known volume on qualitative research has a frequently cited table that describes different types of research and their characteristics from the most general to the most practical.| Type |
Purpose |
Focus 数据挖掘研究院 |
Desired Results |
Desired Generalization |
| Basic research |
Knowledge as end in itself |
Questions deemed important by intellectual interest |
Contribution to theory |
Global |
| Applied research |
Understand nature of human problems |
Questions deemed important by society |
Formulate problem-solving and interventions |
Limited to application context |
| Summative evaluation |
Determine effectivess of human interventions 数据挖掘研究院 |
Goals of the intervention |
Judgments and generalizations about interventions |
All interventions with similar goals |
| Formative evaluation |
Improve an intervention |
Strengths and weaknesses of intervenion |
Recommendations for improvement |
Specific setting studied |
| Action research |
Solve problems |
Problems of organization or community |
Solve problems quickly and effectively |
Here and now |
Another classification emanates from the medical informatics literature and is focused on system development (Stead et al., 1994). It also creates a matrix, although the dimensions are system development and site/level of evaluation. Only certain levels of sites/levels of evaluation are are appropriate for specific levels of system development.
| Definition |
Laboratory - bench |
Laboratory - field |
Remote field - validity |
Remote field - efficacy |
|
| Specification |
*
|
*
|
|||
| Component development |
*
|
||||
| Combination of components into a system |
*
|
*
|
*
|
||
| Integration of system into environment |
*
|
*
|
*
|
*
|
|
| Routine use |
*
|
*
|
*
|
Patton, M. (1990). Qualitative Evaluation and Research Methods (2nd edition) . Newbury Park, CA. Sage Publications.
Stead, W., Haynes, R., Fuller, S., Friedman, C., Travis, L., Beck, J., Fenichel, C., Chandrasekeran, B., Buchanan, B., Abola, E., Sievert, M., Gardner, R., Messerle, J., Jaffe, C., Pearson, W. and Abarbanel, R. (1994). Designing medical informatics research and library-resource projects to increase what is learned. Journal of the American Medical Informatics Association , 1: 28-33. 数据挖掘研究院
(4/9/05) Despont-Gros et al. (2005) have developed a classification of user interactions with clinical information systems based on a review of the human-computer interaction literature. They found that variables assessed in studies included:
- Acceptance
- Affective response
- Crossed (dependences between multiple variables)
- Impact
- Satisfaction
- Success, effectiveness, or performance
- Task technology fit
- Questionnaires
- Interviews
- Observation
- Experiments
- Literature review
- Recommendations
3.2.1 Lancaster and Warner
3.2.2 Fidel and Soergel
3.2.2.1 Setting
3.2.2.2 User
3.2.2.3 Request
3.2.2.4 Database
3.2.2.5 Search system
3.2.2.6 Searcher
3.2.2.7 Search process
3.2.2.8 Search outcome
3.2.3 Hersh and Hickam
3.2.3.1 Was the system used?
3.2.4.2 For what was the system used?
3.2.3.3 Were the users satisfied?
3.2.3.4 How well did they use the system?
3.2.3.5 What factors were associated with successful or unsuccessful use of the system?
3.2.3.6 Did the system have an impact?
3.2.4 Simulation in evaluation experiments
(4/9/05) Although not an IR evaluation study per se, Dresselhaus et al. (2004) compared different approaches to assessing variation among clinicians in the quality of preventive care provided. They looked at used standardized patients (trained actors) as well as computerized clinical vignettes. The measures from the standardized patients included abstraction of the medical record and reports from the standardized patients. Measures from the standardized patients and the clinical vignettes were equally effective in predicting the quality of preventive care provided. However, the clinical vignettes were also noted to be less expensive as well as more easily controlled for case mix at a given site. 数据挖掘研究院Dresselhaus, T., Peabody, J., et al. (2004). An evaluation of vignettes for predicting variation in the quality of preventive care. Journal of General Internal Medicine, 19: 1013-1018.
3.3 Relevance-based evaluation
3.3.1 Recall and precision
(4/12/04) Another relevance-based measure introduced several decades ago attempted to account for the cost of having to assess nonrelevant document. Cooper (1968) defined the expected search length (ESL) as a measurement of retrieval performance that calculated how many nonrelevant documents had to be seen by the user to obtain a specificed number of relevant documents. More recently, Losee (1996) introduced the average search length (ASL), which is the "expected number of documents obtained in retrieving a relevant document, the mean position of a relevant document."Cooper, W. (1968). Expected search length: a single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation, 19: 30-41. 数据挖掘研究院
Losee, R. (1996). Evaluating retrieval performance given database and query characteristics: analytical determination of performance surfaces. Journal of the American Society for Information Science, 47: 95-105.
(4/12/04) Soboroff et al. (2001) proposed the measurement of recall and precision without human relevance judgments. Noting past work by Voorhees (2000, described in the text) demonstrating that differences in judgments did not effect the relative performance of systems, they selected random documents from the retrieval pool of multiple searches on each topic. Their results were most effective when they did not eliminate duplicates from selection (in essence giving more frequently retrieved documents a more likely chance to be selected as relevant). They found that their results were most effective in separating high-performing and low-performing systems from those in the middle, but that they were less successful at identifying the truly best (or worst) systems from among the top (or bottom) performing systems.
Soboroff, I., Nicholas, C. and Cahan, P. (2001). Ranking retrieval systems without relevance judgments. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , New Orleans, LA. ACM Press. 66-73.
3.3.1.1 Similarities to medical diagnostic test evaluation
(4/13/03) Another medical measurement analogy from recall and precision has been defined by Bachmann et al. (2002): the number needed to read (NNR), which is the inverse of precision, i.e., 1/precision. The NNR defines the total number of articles that must be read to find each relevant one. This analogy can actually be carried back to the medical measureument realm, with the inverse of the positive predictive value (equivalent of precision) representing the number needed to test.Bachmann, L., Coray, R., et al. (2002). Identifying diagnostic studies in MEDLINE: reducing the number needed to read. Journal of the American Medical Informatics Association, 9: 653-658.
3.3.1.2 Practical issues in measuring recall and precision
(4/9/05) Another measure commonly used to combine recall and precision is the F measure, which is sometimes called the F1 measure. This measure is the harmonic mean of recall and precision, and uses a parameter β that weights precision or recall more heavily.
Hripcsak, G. and Rothschild, A. (2005). Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12: in press.
3.3.1.3 The special case of ranked output
3.3.1.5 Enhancements to recall and precision
3.3.2 What is relevance?
3.3.2.1 Topical relevance
3.3.2.2 Situational relevance
3.3.3 Research about relevance judgments
3.3.4 Limitations of relevance-based measures
3.3.5 Alternatives to relevance-based measures
(4/12/04) Joachims (2002) has introduced a new approach to evaluation for the Web based on "clickthrough data." It is based on the premise that the links a user clicks on in the results listing from a Web search engine are a measure of relevance. A search engine or system is therefore "better" if more links are clicked from the output of one over the other. He proposes two types of experiments:- Regular clickthrough data - The user′s query is sent to two search engines, with the complete rankings from one system or the other randomly presented to the user.
- Unbiased clickthrough data - The user′s query is sent to two search engines, but in this approach the results are mixed (although order within each set is maintained) together.
Borlund (2003) has proposed a user-based, interactive model for evaluation that designs the evaluation to recreate as realistically as possible the real-world searching environment and allows a more dynamic approach to assigning relevance judgments.
Joachims, T. (2002). Evaluating retrieval performance using clickthrough data. Proceedings of the SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, Tampere, Finland. ACM Press. http://www.cs.cornell.edu/People/tj/publications/joachims_02b.pdf .
Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive retrieval systems. Information Research, 8: 3. http://informationr.net/ir/8-3/paper152.html .
3.4 The Text Retrieval Conference
(4/9/05) A number of new tracks have been introduced to TREC since publication of the book:- Genomics Track - The purpose of the track is to study IR in the genomics (biomedical research) domain (Hersh, 2003; Hersh, 2004). The TREC Genomics Track is obviously relevant to health and biomedical IR, and in fact is chaired by the author.
- HARD Track - The goal of the High Accuracy Retrieval from Documents (HARD) track is to retrieve documents by leveraging additional information about the searcher and/or the search context (Allan, 2003). This is done in several ways. One is to provide some additional metadata about the user, the search context, and the expected result. Another is to allow each group to create a "clarification form" for relevance assessor that solicited information from the user about the query.
- Robust Retrieval Track - The task in the track is traditional ad hoc retrieval, but the focus is on topics that systems performed poorly on in past ad hoc tasks (Voorhees, 2003). Two new scoring measures have been introduced. One counts the number of topics for which no relevant documents were retrieved in the top ten, while the other computes mean average precision over the worst qaurter of topics.
- Terabyte Track - The goal of this track is to assess the scalability of IR techniques for very large collections, i.e., a terabyte in size. A half-gigabyte of Web pages crawled from the .GOV domain have been collection for the document collection.
- Enterprise Track - The purpose of this track is to study the user who is searching the data of an organization to complete some task.
- Spam Track - Aiming to assess one aspect of adversarial IR, this track assesses the ability of systems to detect spam email.
- The Cross-Language Track has spawned two TREC-like initiatives of their own
- Cross-Language Evaluation Forum (CLEF) - Focused on European languages, the 2004 event added an interactive track as well as an image retrieval task (Clough et al., 2005). The latter includes a medical image retrieval task whose organizers include the author.
- NTCIR - Focused on East Asian languages (predominantly Japanese and Chinese), this forum also provides a full spectrum of IR tasks, including retrieval, question-answering, Web searching, and text summarization.
- The Video Track has evolved into a separate TRECVID initiative.
Allan, J. (2003). HARD Track Overview in TREC 2003 - High Accuracy Retrieval from Documents. The Twelfth Text REtrieval Conference - TREC 2003, Gaithersburg, MD. Naitonal Institute of Standards and Technology. 24-37. http://trec.nist.gov/pubs/trec12/papers/HARD.OVERVIEW.pdf.
Clarke, C., Craswell, N., et al. (2004). Overview of the TREC 2004 Terabyte Track. The Thirteenth Text REtrieval Conference Proceedings - TREC 2004, Gaithersburg, MD. National Institute of Standards and Technology. in press. http://trec.nist.gov/pubs/trec13/papers/TERA.OVERVIEW.pdf.
Clough, P., Sanderson, M., et al., eds. (2005). Overview of the CLEF cross language image retrieval track (ImageCLEF) 2004. Multilingual Information Access for Text, Speech, and Images: Result of the Fifth CLEF Evaluation Campaign, Lecture Notes in Computer Science. Heidelberg, Germany. Springer-Verlag. http://ir.shef.ac.uk/cloughie/papers/imageclef2004.pdf.
Fuhr, N. and Lalmas, M. (2004). Report on the INEX 2003 Workshop. SIGIR Forum, 38(1): 46-51. http://www.sigir.org/forum/2004J/10_fuhr.pdf.
Hersh, W., Bhuptiraju, R., et al. (2004). Enhancing access to the bibliome: the TREC Genomics Track. MEDINFO 2004 - Proceedings of the Eleventh World Congress on Medical Informatics, San Francisco, CA. IOS Press, 773-777.
Hersh, W., Bhuptiraju, R., et al. (2004). TREC 2004 genomics track overview. The Thirteenth Text Retrieval Conference: TREC 2004, Gaithersburg, MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec13/papers/GEO.OVERVIEW.pdf.
Kando, N. and Adachi, J. (2004). Report from the NTCIR Workshop 3. SIGIR Forum, 38(1): 10-16. http://www.sigir.org/forum/2004J/4_kando.pdf. 数据挖掘研究院
Voorhees, E. (2003). Overview of the TREC 2003 Robust Retrieval Track. The Twelfth Text REtrieval Conference - TREC 2003, Gaithersburg, MD. National Institute of Standards and Technology. 69-77. http://trec.nist.gov/pubs/trec12/papers/ROBUST.OVERVIEW.pdf.
(4/9/05) The utility measure developed for the TREC Filtering Track was used for the text categorization subtask of the TREC 2004 Genomics Track (Hersh, 2004). This measure contains coefficients for the utility of retrieving a relevant and retrieving a nonrelevant document. The Genomics Track used a version that was normalized by the best possible score:
Unorm = Uraw / Umax
The framework for these measures was based on the following table of possibilities:
| 数据挖掘研究院 |
Relevant (classified) |
Not relevant (not classified) |
Total |
| Retrieved |
True positive (TP) |
False positive (FP) |
All retrieved (AR) |
| Not retrieved |
False negative (FN) |
True negative (TN) |
All not retrieved (ANR) |
| All positive (AP) 数据挖掘研究院 |
All negative (AN) |
For a test collection of documents to categorize, Uraw was calculated as follows:
Uraw = (ur * TP) + (unr * FP)
where:- ur = relative utility of relevant document
- unr = relative utility of nonrelevant document
- Completely perfect prediction - Unorm = 1
- All documents designated positive (triage everything) - 1 > Unorm > 0
- All documents designated negative (triage nothing) - Unorm = 0
- Completely imperfect prediction - Unorm < 0
| Situation |
Unorm - Training | Unorm - Test |
| Completely perfect prediction | 1.0 数据挖掘研究院 |
1.0 |
| Triage everything | 0.27 |
0.33 |
| Triage nothing | 0 |
0 |
| Completely imperfect prediction | -0.73 |
-0.67 |
The measure Umax was calculated by assuming all relevant documents were retrieved and no nonrelevant documents were retrieved:
Umax = ur * AP
(This happens to equal AN.) Hersh, W., Bhuptiraju, R., et al. (2004). TREC 2004 genomics track overview. The Thirteenth Text Retrieval Conference: TREC 2004, Gaithersburg, MD. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec13/papers/GEO.OVERVIEW.pdf.
3.5 Measures of agreement
Hripcsak and Rothschild (2005) investigated the relationship of kappa to the F measure. They showed that when the number of negative cases (or relevant documents) is rare, then the two measures will approach each other mathamatically. This is particularly useful in situations (more common in assessment of natural language understanding systems) where the true number of negative cases is unknown but large.Hripcsak, G. and Rothschild, A. (2005). Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12: in press.

