1.1 Basic definitions
(4/4/06) For those of us working in this field for over a decade, the "mainsteaming" of IR continues to amaze. Over 100 million Americans have used a search engine (Fallows, 2004). Search engines themselves receive about 500 million searches per day from around the world, and on any given day, about 41% of all Internet users submit nearly 60 million queries to search engines (Rainie, 2005). Additional evidence that IR is "mainstream" can be gleaned with regards to the Google search engine. The word "Google" itself has become a verb (i.e., "Did you Google him?"), while teenagers and others pass the time engaging in Googlewhacking, a game where one tries to have Google retrieve one and only one page. There are over 476,000 Googlewhacks and counting(http://www.googlewhack.com/tally.pl). Political columnists ask whether Google is a diety (Friedman, 2003), while the software giant Microsoft has declared that search is the most important computer application in the near future (Ferguson, 2005) and is on a "search and destroy" mission against Google (Vogelstein, 2005). The Google Zeitgeist (http://www.google.com/press/zeitgeist.html) gives a glimpse into what the world wants to know about. 数据挖掘交友
Biomedicine is being impacted by the growth of IR as well. The leaders of the National Library of Medicine has laid out a vision for the future of medical libraries ten years hence, noting that the "place" will be preserved but that most of the information will be interactive and electronic (Lindberg, 2005). A leading neuroscientist, noting the advances in the Human Genome Project and related areas, has noted that biology is now an "information science," with many advances likely to come from using data to form and test hypotheses (Insel, 2003). Major medical journals note that search engines, most notably Google, are the major means that visitors are brought to access their on-line articles (Giustini, 2005; Steinbrook, 2006).
Fallows, D., Rainie, L., et al. (2004). Data Memo on Search Engines. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf. 数据挖掘论坛 Friedman, T. (2003). Is Google God? New York Times. June 29, 2003. 13. http://www.nytimes.com/2003/06/29/opinion/29FRIE.html. Ferguson, C. (2005). What′s Next for Google? Technology Review. January, 2005. 38-46. Insel, T., Volkow, N., et al. (2003). Neuroscience networks: data-sharing in an information age. PLoS Biology, 1: E17. http://biology.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pbio.0000017. Giustini, D. (2005). How Google is changing medicine. British Medical Journal, 331: 1487-1488. Lindberg, D. and Humphreys, B. (2005). 2015 - the future of medical libraries. New England Journal of Medicine, 352: 1067-1070. 数据挖掘工具 Rainie, L. and Shermak, J. (2005). Search Engine Use November 2005. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_SearchData_1105.pdf. Steinbrook, R. (2006). Searching for the right search - reaching the medical literature. New England Journal of Medicine, 354: 4-7. Vogelstein, F. (2005). Search and Destroy. Fortune. May 2, 2005. http://money.cnn.com/magazines/fortune/fortune_archive/2005/05/02/8258478/.
(3/15/05) Another important term worth defining upfront is health information literacy. The Medical Library Association (MLA, www.mlanet.org) has spoken most eloquently about this term, noting that it differs from health literacy and information/computer literacy. They define health information literacy as the set of abilities needed to: 数据挖掘交友
- Recognize a health information need
- Identify likely information sources and use them to retrieve relevant information
- Assess the quality of the information and its applicability to a specific situation
- Analyze, understand, and use the information to make good health decisions
They have developed a Web site devoted to this topic, which includes a variety of resources and plans for action (http://www.mlanet.org/resources/healthlit/).
(4/7/05) Saranto and Hovenga (2004) did a literature review to search for papers attempting to define the concept of information literacy, finding that the concept really does not exist in the literature and that it is most often used as a synonym for "computer literacy" or related concepts. They advocate organizational efforts to define the term and its related skills more precisely. McCray (2005) recently reviewed health literacy and noted most of it focused on low literacy and its impact on understanding health information. She noticed a number of categories of articles on the topic, including: 数据挖掘工具
- Methods to assess literacy and the related topic readability of texts
- The mismatch between the readability of health information and the literacy of those for whom it is intended
- The difficulty patients with low literacy have in the health care system, from accessing care to understanding their treatment plans to their worse clinical outcomes
- The impact of new information technologies
McCray, A. (2005). Promoting health literacy. Journal of the American Medical Informatics Association, 12: 152-163. Saranto, K. and Hovenga, E. (2004). Information literacy - what is it about? Literature review of the concept and context. International Journal of Medical Informatics, 73: 503-513.
1.2 Comparisons with other types of computer applications
1.3 Models of IR
(4/4/06) Another model by which to view IR and related areas is to think of how people usually find and process scientific information. I call this the "Hersh Funnel" (see the figure below) and have published it in several places (Hersh, 2005a; Hersh, 2005b; Hersh, 2006). Scientific information users usually begin by searching against all literature to find a set of documents that contain some documents likely to be relevant. These documents are usually reviewed manually to determine which ones are definitely relevant. However, there is currently much research trying to develop means to find that definitely relevant literature automatically, in processes that are called information extraction or text mining. Typically people structure knowledge out of the documents that are definitely relevant.
 Hersh W, Evaluation of biomedical text mining systems: lessons learned from information retrieval. Briefings in Bioinformatics, 2005, in press. Hersh WR, Information Retrieval and Digital Libraries, in Medical Informatics: Knowledge Management and Data Mining in Biomedicine, Chen H, et al., Editors. 2005, Springer-Verlag: New York. 237-275. Hersh, W., Bhupatiraju, R., et al. (2006). Enhancing access to the bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and Collaboration, 1: 3. http://www.j-biomed-discovery.com/content/1/1/3.
(3/20/04) Just how much information is out there? Lyman and Varian (2003) have attempted to quantify the amount of information on electronic media and its flow. They have found that the sum of information on physical electronic media is about five exabytes (or 5,000 petabytes or 5 million terabytes). This is equivalent to about one-half million new libraries the size of the US Library of Congress. The majority of this information (72%) is stored on magnetic media, primarily hard disks, with most of the remanider on film and a small proportion on paper (about 1.5 petabytes or 0.001 exabytes). 数据挖掘实验室
In a given year, the distrubtion of paper content around the world is as follows:
- Office documents - 279-1,379 terabytes
- Newspapers - 27-138 terabytes
- Mass market periodicals - 10-52 terabytes
- Books - 8-39 terabytes
- Journals - 1.3-6 terabytes
The amount of information on the Internet includes:
- "Surface" Web (fixed Web pages) - 167 terabytes
- "Deep" Web (database-driven Web pages) - 91,850 terabytes
- Email - 440,606 terabytes
- Instant messaging - 274 terabytes
Lyman, P. and Varian, H. (2003). How Much Information. Berkeley, CA, University of California Berkeley. http://www.sims.berkeley.edu/research/projects/how-much-info-2003/ .
(3/31/04) Another model was recently put forth to guide information seeking and retrieval research (Jarvelin and Wilson, 2003). These authors note that scientific theories are useful for a variety of functions, as put forth by Bunge (1967): 数据挖掘论坛
- Systemization of knowledge - ingrating, generalizing, explanation, and expansion
- Guiding research - defining relevant problems, data to collection, proposed new research
- Mapping part of reality - representing or modelling objects and their relationships
They state that different types of information are needed in these information tasks:
- Problem information - the "structure, properties, and requirements of the problem"
- Domain information - the "known facts, concepts, laws, and theories in the domain of the problem"
- Problem-solving information - lays out how problems should be formulated and how problem and domain information should be used
An example of this in health care might be the information task of choosing an appropriate diagnostic test. The problem information recognizes that the task originates from the goal of making a medical diagnosis. The domain information brings forth the knowledge of diagnostic tests for patients who have similar symptoms to the one at hand. The problem-solving information leads the clinician to apply the domain knowledge to this specific patient, resuling in a decision of which test (if any) to order.
Bunge, M. (1967). Scientific Research. Heidelberg. Springer-Verlag. Jarvelin, K. and Wilson, T. (2003). On conceptual models for information seeking and retrieval research. Information Research, 9(1). http://informationr.net/ir/9-1/paper163.html .
1.3.1 The information world
1.3.2 Users
1.3.3 Health decision making
1.4 IR resources
1.4.1 People
(3/15/05) A major pioneer in the IR field was Gerard Salton, a professor of computer science from Cornell University. Dr. Salton invented many of the techniques commonly called "automated retrieval" and is cited throughout the book. Unfortunately, he passed away in 1995 just as the Web searching world was taking off and adopting many of the techniques he developed. Shortly before his death, a conference was held in his honor, celebrating his contributions. Many of the talks from this meeting were captured and are available in the Open Video Project (http://www.open-video.org/) archive. To find the Salton videos, go to this site and search on "Salton." Salton′s talk himself is available at http://www.open-video.org/details.php?videoid=7057. Although I did not know him well personally, I was certainly drawn into this field by his writings. I was also impressed at his continued ability to be engaged in the field right up until his death. (Most of us old-timers recall him and Karen Sparck Jones sitting in the front row at conferences, critiquing presentations and each other′s thoughts.)
(3/20/04) The book notes that indivuduals from a variety of disciplines comprise the field of IR. A well-known computer scientist who is among the leaders from that discipline recently gave a keynote lecture discussing the relationship between IR and computer science (Croft, 2003). Croft noted that IR has always been a small part of the overall computer science field but has a common heritage with the database systems area. He also noted that the field grew and was validated by the success of Web search engines in the 1990s. He also laid out some known successes by the field:
- Search engines have become a significant means by which society accesses information.
- IR has long championed the "statistical" approach to using language, which has now been adopted by other areas of computer science, such as natural language processing.
- IR has focused on large-scale evaluation more extensively than other areas of computer science, which have come to adopt many of these techniques.
- IR has also focused on the importance of the user and interaction as part of its process.
- The global goals of information access and contextual retrieval (see below) are part of the vision of other grand research goals for computer science, e.g., (Gray, 2003) 数据挖掘交友
Gray, J. (2003). What next? A dozen information technology research goals. Journal of the ACM, 50: 41-57. Croft, W. (2003). Salton Award Lecture - Information retrieval and computer science: an evolving relationship. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada. ACM Press. 2-3.
(4/4/06) A recent paper by Moffat (2005) provides a list of the most important IR research papers that are "recommended reading" for research students.
Moffat, A., Zobel, J., et al. (2005). Recommended reading for IR research students. SIGIR Forum, 39(2). http://www.acm.org/sigir/forum/2005D/2005d_sigirforum_moffat.pdf.
1.4.2 Organizations
1.4.3 Journals
(5/6/03) There are three on-line journals devoted to information retrieval and digital library research: 数据挖掘论坛
(3/20/04) There are also some new on-line biomedical journals with a strong focus on IR-related issues:
1.4.4 Texts
(4/4/06) For those interested in image retrieval and its associated issues, an overview textbook is Visual Information Retrieval (Del Bimbo, 1999). Another overview book is Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization (Jackson, 2002). The book provides a succinct but comprehensive overview of natural language processing, document retrieval, information extraction, text categorization, and text mining. A comprehensive three-volume reference on health and medical information on the Web, available in both print and on CD-ROM, is The MLA Encyclopedic Guide to Searching and Finding Health Information on the Web (Anderson, 2004). A couple books have been published recently on Web searching (Hock, 2004; Poremsky, 2004) and another describes using the Web for research (Schlein, 2004). Another recent book addresses the overlap between IR and information seeking in context (Ingwersen, 2005). 数据挖掘交友
There are fewer books on searching MEDLINE and other NLM resources these days, but excellent help can be found in the tutorials and help files on the PubMed site:
Anderson, P. and Allee, N., eds. (2004). The MLA Encyclopedic Guide to Searching and Finding Health Information on the Web. New York, NY. Neal-Schuman Publishers. Del Bimbo, A. (1999). Visual Information Retrieval. San Francisco, CA. Morgan Kaufmann Publishers. Hock, R. (2004). The Extreme Searcher′s Internet Handbook: A Guide for the Serious Searcher. New York, NY. Information Today. 数据挖掘论坛 Ingwersen, P. and Jarvelin, K. (2005). The Turn - Integration of Information Seeking and Retrieval in Context. Dordrecht, The Netherlands. Springer. Jackson, P. and Moulinier, I. (2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. Amsterdam, Holland. Benjamin Johns Publishing. Poremsky, D. (2004). Google and Other Search Engines. Berkeley, CA. Peachpit Press. Schlein, A. (2004). Find It Online, Fourth Edition: The Complete Guide to Online Research. Tempe, AZ. Facts on Demand Press.
1.4.5 Tools
(4/4/06) An up-to-date list of open-source search engines is at http://www.searchtools.com/tools/tools-opensource.html . One system of note is Lucene (Gospodnetic, 2005), which is written in Java and is now part of the open-source Web server Apache. Another new IR system for research use is Zettair. Written by a group known for their accomplishments in index compression and search speed, this system is fast and flexible. 数据挖掘实验室
Gospodnetic, O. and Hatcher, E. (2005). Lucene in Action. Greenwich, CT. Manning Publications.
1.5 The Internet and World Wide Web
(3/20/04) Internet and Web usage is highly prevalent among Americans. In 2003, it was found that 126 million or 63% of Americans were on-line (Fox and Fallows, 2003). Usage was divided roughly equal among the sexes. Although use in all groups was growing, there were still digital divides by age, income, and ethnicity. Activities such as electronic commerce, downloading music, and on-line banking led in terms of growth. The most common uses of the Internet by Americans were found to be:
- Email - 102 million users
- Searching for answers to specific questions - 100 million users
- Instant messaging - 52 million users
- Doing work or research for job on-line - 61 million users
Fox, S. and Fallows, D. (2003). Internet Health Resources: Health searches and email have become more commonplace, but there is room for improvement in searches and overall Internet access. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/reports/toc.asp?Report=95 . 数据挖掘论坛
(4/4/06) Search engine use is very high among Internet users. In a survey done in May-June, 2004, Fallows et al. (2004) found that 84% of Internet users have used a search engine (extrapolating from the usage statistics cited above, that translates into 107 million people) and that 87% of people say they find what they want most of the time. This memo also presented some facts gleaned from tracking the top 25 search engines:
- Americans conduct 3.9 billion searches per month
- The average user performs 33 searches per month, spending about 41 minutes at search engine sites
- The average visit to search engine results in 4.4 searches
- The most popular search engine is Google (used for 47% of searches), followed by Yahoo (26%)
A recent analysis of search engine users shows that while they are enthusiastic and trusting of search engines, they are also unaware and naive about certain aspects of them (Fallows, 2005). A large majority of users report confidence in their searching abilities (92%) and that they have successful searches most of the time (87%). However, 62% are unaware in the differences between paid and unpaid results. 数据挖掘交友
One Google scientist notes that Google receives about 200 million searches per day (Singhal, 2004). Extrapolating from Google′s market share, this means about 500 million searches are done per day. Of Google′s 200 million searches each day, 100 million are unique. Searches average 2.4 words and are entered in 90 different languages. About 10-20% of the pages in Google′s database change each month.
The Web has truly become "world-wide." According to Internet World Stats (http://www.internetworldstats.com), some 1.0 billion of the world′s 6.5 billion people use the Internet (15.9%). It is of course higher in developed regions/countries such as the United States (68.1%), Oceania/Australia (52.1%), and Europe (35.9%). But it is growing even more rapidly in places like Latin America (14.8%, as high as 35.7% in Chile and 26.4% in Argentina) and China (8.5%).
One irony that few IR "old timers" could ever have fathomed is the need, in the Web era, for the study of "adversarial" IR. In other words, the development of techniques to prevent retrieval of certain content. One group of aversarial IR applications is the prevention of "spam" (i.e., unwanted) pages or emails (Metaxas, 2005). Singhal (2004) notes there is a continual tit-for-tat battle between those who devleop search engines and those who try to "game" them. Indeed, there is a large market for attempting to drive traffic to one′s Web site via search engines and other means, sometimes called "search engine marketing" (e.g., Moran and Hunt, 2005). Another form of adversarial IR is in "filtering," with the usual goal of preventing linkage to pornography sites. Of course, most approaches to such filtering are imperfect and can lead to blocking of legitimate medical Web sites (Richardson et al, 2002). Indeed, one filter even blocks access to the Web site of the town Toppenish, WA, due to the presence of the letters from a blocked word in the middle of the town name (Anonymous, 2003).
Another concern about search engines is the growing desire of governments to monitor their usage (Hansell, 2006). Ostensibly to thwart the very real threats of terrorism, many are concerned about governments knowing our searching interests. There are also some governments, most notably China, who have required search engines to filter pages containing certain words (such as democracy). At the current time, privacy laws that protect things like email and library check-outs do not protect queries to search engines.
Fallows, D., Rainie, L., et al. (2004). Data Memo on Search Engines. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf. Fallows, D. (2005). Search Engine Users. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Searchengine_users.pdf. Hansell, S. (2006). Online Trail Can Lead To Court. New York Times. February 4, 2006. C1. http://www.nytimes.com/2006/02/04/technology/04privacy.html. Singhal, A. (2004). Challenges in Running a Commercial Web Search Engine. http://www.research.ibm.com/haifa/Workshops/searchandcollaboration2004/papers/haifa.pdf. Metaxas, P. and DeStefano, J. (2005). Web spam, propaganda and trust. First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan. http://airweb.cse.lehigh.edu/2005/metaxas.pdf. Moran, M. and Hunt, B. (2005). Search Engine Marketing, Inc.: Driving Search Traffic to Your Company′s Web Site. Englewood Cliffs, NJ. Prentice Hall. 数据挖掘研究院 Richardson, C., Resnick, P., et al. (2002). Does pornography-blocking software block access to health information on the Internet? Journal of the American Medical Association, 288: 2887-2894. Anonymous (2003). The Insider: ′Bess′ Internet porn filter a little too easily offended. Seattle Post-Intelligencer. July 7, 2003. http://seattlepi.nwsource.com/business/129611_insider07.html.
(3/20/04) The popularization of the Web arguably began with the release of the Mosaic Web browser. The tenth anniversary of the release of this software was recently celebrated at the US National Science Foundation (Anonymous, 2003).
Anonymous (2003). Mosaic Web Browser Celebrates 10th Birthday. National Science Foundation. http://www.nsf.gov/od/lpa/news/03/pr0343.htm . 数据挖掘交友
(1/10/03) According to Gartner, the one-billionth PC was sold in 2002: http://www.intel.com/pressroom/archive/releases/20020701corp.htm
(3/1/03) In Feb. 2003, the National Science Foundation released a report on Cyberinfrastructure , recommending that the organization spend an additional $1 billion per year developing the nation’s "cyberinfrastructure" to support scientific research. The report advocates that investment in a comprehensive cyberinfrastructure will profoundly change what scientists and engineers do, how they do it, and who participates.
(4/5/03) The presentation about Web searching by Travis and Broder (2001) cited in the book was recently published: Broder A, A taxonomy of Web search, SIGIR Forum, 2002, 36(2): 3-10. http://www.acm.org/sigir/forum/F2002/broder.pdf
In this paper, Travis notes that classic IR is driven by the user′s information need, but that Web searching is often not informational. Instead, the user′s intent might be navigational (e.g., finding a specific page) or transactional (e.g., purchase something, download a file, check the status of an account). Travis notes that navigational searches are similar to what classic IR calls a "known-item search," since they usually have only one correct answer. He also states that "hub" pages (see section 1.5.2) with lists of links that get to the target in one click may be acceptable. In transcational queries, the user needs not only to reach a site, but also interact with it once he or she gets there.
Travis analyzed the frequency of these types of Web search by users of the AltaVista search engine via two means: a pop-up survey window and a search log analysis. He noted noted the limitations in each: Pop-up survey takers were self-selected and may not represent all users or their needs. In addition, it is usually difficult to know a user′s exact intent from the query statement they enter into a search engine. Based on his data, he concludes the following approximate distribution of types of Web search: 数据挖掘工具
- Informational - 39-48%
- Navigational - 20-24%
- Transactional - 30-36%
In other words, less than half of searches on the Web (at least those entered into AltaVista) are classical IR informational seeking.
Travis also describes what he calls three generations of search engines on the Web. The first generation uses mostly static HTML pages and is very close to classic IR. The second generation uses off-page, Web-specific data such as link analysis, anchor text, and click-through data. He cites the Google PageRank algorithm as an example of this and notes that it supports informational as well as navigational queries. The third generation attempts to discern the "need behind the query" based on semantic analysis of the user′s input and determination of their context. He gives the example of the user entering the name of a city and the system returning a hotel reservation page, map server, weather server, etc.. The aim of this generation would be to support all transactional searches in addition to those which are informational and navigational. He does not state so explicitly, but an implication is that the Semantic Web (see section 10.2.4) will be helpful for this type of search. 数据挖掘交友
(3/20/04) A collection of leaders in the field recently held a workshop for defining the research agenda for the IR field (Allan et al., 2003). This workshop was motivated in part by the notion that current Web search engines are so effective that further research and development in the field are not warranted. However, this workshop noted that Web searching has become mainstream and successful, but is not the entire IR picture:
- Web searching and IR are not equivalent - Web searching is at best a part of overall information access.
- Web queries do not represent all information - Users do much more than search for the Web pages and other content indexed by Web search engines
- Web search engines are effective for some types of queries in some contexts - There are many times when users are looking for more specific and/or different information that resides on the Web.
This workshop came to the conclusion that there are two general long-term challenges for the field: 数据挖掘论坛
- Global information access - Information needs should be satisfied through "natural, efficient interaction with an automated system that leverages world-wide structured and unstructured data in any language."
- Contextual retrieval - Search technologies and knowledge should be used together to find the most appropriate content for a user′s information need.
In other words, research in IR must aim to create systems that seamlessly search across the appropriate content at the appropriate time. The paper gives a well-cited example of a user entering a query for "Taj Mahal." If the user′s system knew that he or she was going to attend an academic conference in India, it would provide him or her with information about the famous landmark . However, if the user was planning a trip to Atlantic City or enjoyed jazz music, the system would preferentially present information about the casino or jazz musician respectively.
The report then outlined what workshop attendees considered to be the major challenge areas for IR research:
- Retrieval models - Web search engines tend to have a "one size fits all" model that does not take into account other tasks that the user wishes to perform, such as answer questions, browse specific collections of information, find certain types of content, etc.
- Cross-language IR - While English was initially the predominant language of the Web, less than half of all pages are in English and at some point in the future, other languages might surpass it. Systems need to find content in other languages when appropriate and provide the user a summary so he or she can determine whether to expend the resources to translate it.
- Web search - While Web search is not the only type of IR application, it is certainly very popular, and further research must continue to improve it.
- User modeling - Different users have diverse needs, even when searching for the same "topic." This is certainly true in health care, where a patient, primary care physician, and subspecialist all might want information on the same topic but bring different levels of reading abiilty, prior knowledge, and so forth to the information seeking process.
The report also lists the following areas as those of major challenge, but they really represent specific instances of the above general challenges: 数据挖掘研究院
- Filtering, topic detection and tracking, and classification - Users need intelligent tools to sift through large quantities of information to follow topics and threads.
- Summarization - Users can also be aided by systems that more effectively summarize information instead of just presenting lists of specific content.
- Question-Answering - A common use of IR systems is to answer a specific question, yet most systems just present content that the user must read to find that answer.
- Metasearch - Users often want to search over multiple resources that may not all be represented in a single index.
- Multimedia - While early IR was focused on text, real users are also interested in finding other types of content, such as images, videos, and sounds.
Allan, J., Aslam, J., et al. (2003). Challenges in information retrieval and language modeling. SIGIR Forum, 37(1): 31-47. http://www.sigir.org/forum/S2003/ir-challenges2.pdf . 数据挖掘实验室
1.5.1 Hypertext and Hypermedia
1.5.2 The Web and health care
(2/12/03) According to the American Medical Association, physician use of the Web continues to grow beyond the figures cited in the second edition. Some other facts from this report include:
- Two-thirds of physicians who use the Web do so daily.
- The average physician user of the Web spends 7.1 hours per week using it
- About 65% of physicians over 60 years of age use the Web, showing use is not limited to younger physicians
- About 30% of physicians have a Web site for their practice
A press release describing the report is at: http://www.ama-assn.org/ama/pub/article/1616-6473.html
(4/4/06) A more recent summary of physician Internet usage data found that 98% of US physicians use the Internet while half own personal digital assistants (PDAs) (Anonymous, 2005). A proposed new model of continuing medical education gives credit for documented information seeking during clinical care (Davis, 2004). 数据挖掘研究院
Anonymous (2005). Physician Internet Use Statistics. http://www.max.md/pdf/PhysicianInternetUseStatistics.pdf. Davis, N. and Willis, C. (2004). A new metric for continuing medical education credit. Journal of Continuing Education in the Health Professions, 24: 139-144.
(3/20/04) Another study of physician used showed those who were more active clinically (i.e., saw more patients per week) spent more time on-line (Taylor and Leitman, 2001).
Taylor, H. and Leitman, R. (2001). The Increasing Impact of eHealth on Physician Behavior. Harris Interactive. http://www.harrisinteractive.com/news/newsletters/healthnews/HI_HealthCareNews2001Vol1_iss31.pdf.
(4/4/06) Another growing category of IR system users are biomedical researchers. This is due in large part to new "high-throughput" biotechnologies, such as gene microarrays. These technologies not only generate large amounts of data, but also identify new information that must be explored, e.g., the microarray experiment that uncovers increased expression of genes previously unknown to be related to a physiological or disease process. There is growing awareness that IR and other techniques, such as text mining, are important tools for researchers (Jensen, 2006; Hunter, 2006) 数据挖掘工具
But literature retrieval and analysis are difficult for scientists. Barnes and Gary (2003) say, "Few areas of biological research call for a broader background in biology than the modern approach to genetics. This background is tested to the extreme in the selection of candidate genes for involvement with a disease process… Literature is the most powerful resource to support this process, but it is also the most complex and confounding data source to search."
Barnes, M. and Gary, R. (2003). Bioinformatics for Geneticists . West Sussex, England. John Wiley & Sons. Hunter, L. and Cohen, K. (2006). Biomedical language processing: what′s beyond PubMed? Molecular Cell, 21: 589-594. Jensen, L., Saric, J., et al. (2006). Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews - Genetics, 7: 119-129.
(4/4/06) Another group of heavy users of the Web for health and biomedical information are consumers and patients. One area of debate concerns how often they use the Web to seek health information. Some early reports put the figure at as high as 80% of all Internet users, such as a Harris Interactive Poll (2002). This poll found about 80% of all adults who are online sometimes used the Web to look for health care information. About 18% said they did so "often", while most did so "sometimes" (35%), or "hardly ever" (27%). This 80% of all those online amounted to 110 million users nationwide. This compared with 54 million in 1998, 69 million in 1999 and 97 million in 2001. On average those who ever looked for health care information online did so three times every month. About half (53%) of those who looked for health care information used a portal or search engine that allowed them to search for the health information they wanted across many different Web sites. About a quarter (26%) went directly to a site that focused only on health-related topics and one in eight (12%) went first to a general site that focused on many topics that may have had a section on health issues. 数据挖掘工具
Taylor, H. (2002). Cyberchondriacs Update. Harris Interactive. http://www.harrisinteractive.com/harris_poll/index.asp?PID=299.
The Pew Internet & American Life Project has published a number of studies on information seeking, related not only to health care but also to search engines in general:
- Fox, S. and Rainie, L. (2000). The Online Health Care Revolution: How the Web Helps Americans Take Better Care of Themselves. Pew Internet & American Life Project. http://www.pewinternet.org/reports/toc.asp?Report=26
- Fox, S. and Rainie, L. (2000). Vital Decisions: How Internet users decide what information to trust when they or their loved ones are sick. Pew Internet & American Life Project. http://www.pewinternet.org/reports/toc.asp?Report=59
- Fox, S. (2002). Search Engines: A Pew Internet Project Data Memo. Pew Internet & American Life Project. http://www.pewinternet.org/reports/toc.asp?Report=64
- Fox, S. and Fallows, D. (2003). Internet Health Resources: Health searches and email have become more commonplace, but there is room for improvement in searches and overall Internet access. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/reports/toc.asp?Report=95
- Fallows, D., Rainie, L., et al. (2004). Data Memo on Search Engines. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Data_Memo_Searchengines.pdf
- Fallows, D. (2005). Search Engine Users. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_Searchengine_users.pdf
- Rainie, L. and Shermak, J. (2005). Search Engine Use November 2005. Washington, DC, Pew Internet & American Life Project. http://www.pewinternet.org/pdfs/PIP_SearchData_1105.pdf
数据挖掘研究院
The Fox, 2003 report found that 73 million US users have searched for specific health information (58% of all users) and 93 million have carried out a search related to health information generally (74% of all users). Both the Fox, 2003 and Rainie, 2005 reports found that 80% of users had searched specifically for health information. A recent analysis of all the Pew data found several factors associated with a higher likelihood of Internet health searching, including female gender, part-time employed, other Internet use, specific health problems, and helping others deal with health problems (Rice, 2006).
Other studies, however, have taken exception to these high rates of use. Most notably, a study in JAMA claimed that only 40% of Internet users actually used the Web to seek health information (Baker et al., 2003). This study also found that only a third of those who sought health information reported that the information affected a decision about their health or health care. A number of letters to the editor pointed out limitations of this study, the most notable one being that study participants came from a pool of users offered free access to WebTV, a form of Internet access by a very small fraction of all users. Another study found that 31% of all Americans (not just those on-line) have used the Internet to search for health information over a 12-month period (Murray et al., 2003). About 8% of these individuals took information to their physician, although two-thirds wanted their physician′s opinion as opposed to specific treatment. Additional research has found that only a minority of Americans (38.2%) seek health information generally, with the most common sources being books or magazines (23.0%), friends or relatives (19.7%), the Internet (16.1%), and television and radio (11.3%) (Tu and Hargraves, 2003). 数据挖掘论坛
Baker, L., Wagner, T., et al. (2003). Use of the Internet and e-mail for health care information: results from a national survey. Journal of the American Medical Association, 289: 2400-2406. Murray, E., Lo, B., et al. (2003). The impact of health information on the internet on the physician-patient relationship: patient perceptions. Archives of Internal Medicine, 163: 1727-1734. Rice, R. (2005). Influences, usage, and outcomes of Internet health information searching: multivariate results from the Pew surveys. International Journal of Medical Informatics, 75: 8-28. Tu, H. and Hargraves, J. (2003). Seeking Health Care Information: Most Consumers Still on the Sidelines. Washington, DC, Center for Studying Health System Change. http://www.hschange.org/CONTENT/537/.
Surveys of users of on-line health information show that they believe there is room for improvement (Anonymous, 2003). The following was found in this survey of about 3,000 users of on-line health information in 2002 by Manhattan Research: 数据挖掘论坛
- 65% believe accuracy of on-line health information needs to increase.
- 64% believe the quality of such information must improve.
- 22% have difficulty reading and understanding on-line health information.
- 51% have a hard time determining credibility of this information.
- 81% state that content reviewed by a health care professional increases their likelihood of trusting the information they find.
- 80% say that separation of content from advertising drives their trust in the information.
Anonymous (2003). Americans Expect More of Their Online Health Information Resources. New York, NY, Manhattan Research. http://www.manhattanresearch.com/Credibility,%20Accuracy,%20and%20Readability%20(052703).pdf.
1.5.3 Models of the World Wide Web
(3/20/04) Work on modeling the Web continues to grow. An entire book recently appeared on the topic, covering such areas as text analysis, link analysis, and human behavior (Baldi et al., 2003). A related area of work (and a book) involves "mining" the Web for information and knowledge (Chakrabarti, 2003).
Baldi, P., Frasconi, P., et al. (2003). Modeling the Internet and the Web - Probabilistic Methods and Algorithms. West Sussex, England. John Wiley & Sons. Chakrabarti, S. (2003). Mining the Web - Discovering Knowledge from Hypertext Data. San Francisco, CA. Morgan Kauffman.
(3/15/05) Further analysis of the Web has provided insights to general phenomena of networked systems. Barabási (2002) has studied this extensively, noting similarities in different types of networks. He has also noted this phenomenon in the organization of the living cell (Barabási, 2004)
Barabási, A. (2002). Linked: The New Science of Networks. Cambridge, MA. MIT Press. Barabási, A. and Oltvai, Z. (2004). Network biology: understanding the cell′s functional organization. Nature Reviews - Genetics, 5: 101-113.
1.6 A sample document database for examples
|