Information science interest in hyperlinks started in 1996 and has been mainly driven by analogies with citations in journal articles. The fields of bibliometric and scientometrics make extensive use of citations both to help assess the quality of academic work and to trace patterns of scholarly communication (Borgman & Furner, 2002; Wouters, 1999). The underlying assumptions are that more important, or higher-quality articles, will tend to be cited more, and that citations often indicate that the work in the cited article has been built upon or otherwise used by the citing article (Cronin, 1984). In fact, reasons for citing are extremely diverse (Borgman & Furner, 2002) but citation analysis remains an effective, if controversial, tool that is used for a variety of purposes (Garfield, 1979; Oppenheim, 1997; Moed, 2002).
Two early articles examined the use of hyperlinks to track Web information (Larson, 1996; Rodríguez Gairín, 1997). An extensive theoretical discussion by Almind and Ingwersen (1997) also set the foundations for and gave a name to the new field of Webometrics. The event that triggered Webometrics was the deployment by commercial search engines such as AltaVista of an interface that allowed anybody to count links between large web spaces with a simple command. This made it possible to think about creating techniques to exploit this new facility and to begin to speculate about and investigate potential new applications. The information scientists who noted this potential naturally turned to their own disciplines to look for applications, and the apparently close analogy between hyperlinks and citations, both being the referencing of one document by another, gave them a ready-made set of research questions and techniques through the adaptation of citation analysis. 数据挖掘工具
Rousseau (1997) popularized the term sitation for a Web hyperlink, foregrounding the citation analogy. Aguillo (1998) at the same time started the e-journal Cybermetrics and began an extensive investigation into various aspects of Web and Internet use, including hyperlinks. The analogy between hyperlinks and citations has continued to generate interest within information science, including speculations about the kind of information that they could reveal in different contexts (Borgman & Furner, 2002; Björneborn. & Ingwersen, 2001; Cronin, 2001; Davenport & Cronin, 2000; Thelwall, 2002i).
The starting point for Webometrics, then, was the attempt to apply citation analysis to the Web context. Since citation analysis tracks (to some extent) scholarly communication, some researchers have sought to use hyperlink counts as a measure of the extent of online communication between the owners of two or more sets of Web pages. Other citation analysis attempts to evaluate bodies of work through their citation counts, which has lead to a second type of Webometrics approach: to see whether link counts can be valid measures of online impact. This has lead to investigations into whether pages attract hyperlinks primarily for the quality or interest level of their contents, so that hyperlink counts would measure some kind of online impact.
The starting points in terms of methods are not just formulae and algorithms for computing useful information, but also a wide range of data validation techniques, partly a legacy of the continued controversy surrounding evaluative citation analysis.
Data collection: Web crawlers and commercial search engines
Early Webometric studies used commercial search engine advanced queries to obtain hyperlink counts, primarily AltaVista and AllTheWeb. For example, entering the query host:wlv.ac.uk AND link:knaw.nl into the AltaVista advanced query section could be used to count the number of pages with domain names ending in wlv.ac.uk that contain a hyperlink that includes the text knaw.nl. This could be expected to give a count of Wolverhampton University ( http://www.wlv.ac.uk ) pages that hyperlink to the Koninklijke Nederlandse Akademie van Wetenschappen (KNAW, Royal Netherlands Academy of Arts and Sciences, http://www.knaw.nl ), either the main site or a subdomain in each case. Such queries gave easy access to useful data from the huge search engine databases. One of well-known drawbacks is that no search engine can index the whole Web (Thelwall, 2002h ) and actual coverage in 1999 appeared to be below 16% for the major search engines of the time (Lawrence & Giles, 1999). If a commercial search engine is used, then clearly this is a limitation that must be accepted and discussed in the study. 数据挖掘交友
A second major problem, however, was immediately discovered: that the results returned by search engines fluctuated irregularly, sometimes dramatically (Bar-Ilan, 1999; Mettrop & Nieuwenhuysen, 2001; Rousseau, 1999; Snyder & Rosenbaum, 1999; Thelwall, 1999). In order to combat this, Rousseau (1999) proposed multiple search rounds and an averaging process.
Additionally, search engines have exhibited peculiarities in behavior in the past. Snyder and Rosenbaum (1999) reported that AltaVista had the following problems: The AltaVista metaterm ′link′ is intended to retrieve the total number of pages each of which has at least one link to a specified page. In practice, the metaterm frequently fails to retrieve all, or even most of the linkpages. Conversely, the ′link′ command sometimes retrieves pages that do not contain the link specified. (p. 380). A more recent systematic comparison of AltaVista′s results with that of a specialist crawler found it to be highly reliable for UK university sites (Thelwall, 2001a). This particular problem seems to have disappeared, although there is no guarantee that other problems will not occur in the future.
A logical alternative is to create a specialist information science Web crawler to ensure reliable access to data, as called for by Bar-Ilan (2001). Such a tool has now been developed and extensively used (Thelwall, 2001ef) but this is not on a scale to rival commercial search engines, being capable of crawling all universities in a single country within a month, but not capable of giving significant international coverage. As a result, and helped by improvements in search engine reliability (Thelwall, 2001g; Vaughan & Thelwall, 2003), subsequent research has used both approaches.
In situations where the results are suspected to be unreliable, or because the query sent cannot be specific enough to capture the data, human observation can be used either as an additional filtering step, or a random sample can be taken to estimate the accuracy of the results (Cronin et al., 1998). This approach could be used, for example to ensure that the hyperlink still connected to the desired Web site (or Web page) in case the site (or page) had moved or been deleted since the hyperlink was created. In this case, error messages such as "The server does not have a DNS entry" or "404 URL Not Found" would often be obtained.
Data analysis methodologies
A key article that provoked many follow-up investigations was that of Ingwersen (1998), which introduced the Web Impact Factor (WIF). This is a metric designed to assess the impact of an area of the Web based upon counting the number of hyperlinks to it. In fact many different variants were proposed and tested but the most initially successful were the external absolute WIF and the external relative WIF. The former is simply the number of pages outside the area being measured that contain a hyperlink to it. The latter is this figure divided by the number of pages in the target area. Both of these omit hyperlinks between pages within the target area, these being often for navigation purposes and are therefore not useful indicators of external impact.
These or other hyperlink metrics have been applied to various Web spaces (Ingwersen, 1998). A summary of the different spaces analyzed is listed in Table 1. Key findings will be summarized separately later.
| Research Areas |
Articles |
| Journal articles |
Goodrum et al., (2002) |
| Journal Web sites |
Smith (1999a), Harter & Ford (2000), Vaughan & Hysen (2002), Vaughan & Thelwall (2003) |
| Countries |
Ingwersen, (1998), Thelwall (2001b) |
| Sectors within collections of countries |
Leydesdorff & Curran (2000) |
| National collections of universities |
Smith (1999a), Thelwall & Harries, (2003), Thelwall & Wilkinson (2003a), Thelwall (2000, 2001ad, 2002a-f, 2003ab) |
| International collections of universities |
Smith (1999b), Polanco et al. (2001), Smith & Thelwall (2002), Thelwall & Smith (2002), Thelwall & Tang (2003), Thelwall et al. (2003) |
| Academic departmental Web sites |
Darmoni et al. (2000), Thomas & Willett (2000), Chu et al. (2002), Douyère et al., (2002), Li et al. (2002), Tang & Thelwall (2003) |
| Other academic-related collections of sites or pages |
Hernandez-Borges et al. (1999), Cui (1999), Björneborn (2001), Soualmia et al. (2002) |
| Commercial Web sites |
Thelwall (2001bc) |
数据挖掘论坛 Table 1. Research areas and articles.
One problem with the relative WIF is that counts of pages within an area of the Web can be substantially more unreliable than hyperlink counts. This is due to various factors including mirror site inclusion and design decisions about Web page sizes and format (Thelwall, 2001a). In response to this, academic WIFs have also used university staff numbers as their denominator (Thelwall, 2001a, 2003b), giving improved results. An alternative approach has been to dispense with the WIF denominator altogether and to model the data using the raw hyperlink counts, which arguably gives more intuitive results (Thelwall, 2002b).
Other analytic approaches have included a variety of multivariate statistical techniques and pathfinder network scaling (Thelwall, 2002e) but these have all been found wanting when applied to national academic Webs, because of a poor fit between the underlying assumptions of the statistical tests and the multiple trends present in the data. A simpler approach, applicable for small collections of sites analyzed, is the network diagram where arrows are drawn connecting sites with thickness proportional to a hyperlink-based calculation (Thelwall, 2001b). Even this approach has its limitations, however, with four potential arrow thickness metrics giving different results and problems of scale occurring with domains of different sizes (Thelwall & Smith, 2002). The four metrics used for arrow thickness are as follows.
- Total link counts
- Total links divided by total number of pages in the target site(s)
- Total links divided by total number of pages in the source site(s)
- Total links divided by total number of pages in the source and target site(s)
Each method gives a different perspective on the data set and it can be useful to produce all four. Total link counts paint a picture of the total linking among the set whereas the last one factors out size so that the underlying tendency to link can be seen. In contrast, dividing by target site pages gives an indication of which sites attract the most links per page, and where these links come from. Dividing by source size gives an indication of which sites host the most links per page, and the sites that they target. 数据挖掘交友
Validating Web hyperlink counts: Correlations and motivations
When a new data source is found, it is important to validate it in terms of reliability, representativity and its potential uses. Typically, this process involves theoretical studies of the data source as well as statistical exercises to compare the data with other more known quantities (Oppenheim, 2000). As a result, early Webometric studies compared WIFs with other impact measures. It was found that WIFs did not correlate highly with journal impact factors for e-journals (Harter & Ford, 2000; Smith, 1999a) and did not seem to be related to the research impact or quality of universities or departments (Smith, 1999a; Thelwall, 2000; Thomas & Willett, 2000). These results were a major blow for Webometrics, but then a whole series of positive results showed that the approaches did have potential. A study of 25 UK universities made the breakthrough by showing a significant correlation between WIFs and average research quality, both for AltaVista data and specialist crawler data (Thelwall, 2001a). Since then, further significant results have been found for a larger group of 109 UK universities (Thelwall, 2002d), Australian universities (Smith & Thelwall, 2002) and Taiwanese universities (Thelwall & Tang, 2003). In addition significant correlations have now been found for journal Web sites (Vaughan & Hysen, 2002; Vaughan & Thelwall, 2003) and academic departments (Chu et al., 2002; Li et al., 2002; Tang & Thelwall, 2003). As a result of these studies, it can now be concluded that counts of hyperlinks to academic-related Web sites frequently strongly associate with research quality. This does not imply, however, that there is a cause-and-effect relationship between the two. To address this issue, it is necessary to find out why hyperlinks are created and to study motivations for hyperlinking. 数据挖掘工具
Some of the earlier correlation-based studies included exercises that attempted to identify the motivations behind hyperlink creation. After investigating various hypotheses to explain these (including the Matthew effect) (Thelwall, 2001a, 2002b, 2003b), it seems that there is greater Web-related activity in more institutions which produce more research.
At the individual researcher level, Kim (2000) explored the influences on a researcher′s hyperlinking behavior, based on a focus interview with 15 scholars who included external (also called outgoing) hyperlinks in their academic papers. Although the hyperlinks in scholarly electronic articles are created as the result of a variety of motivations, he found that scholarly and social motivations are as strong as technological reasons. At the interorganizational level, in the case study of ArXiv.org and SPIRES-HEP ( http://www.slac.stanford.edu/spires/hep) which are a Web-accessible pre-print article server and an extensive bibliographic database respectively, Kling et al. (2001) found that hyperlinks between two organizations were made by a variety of social, economic, and technological relationships such as a joint grant research and a Web site interface. Although hyperlinking technology was born of the advancement of computer technologies, whether and how hyperlinking among people and organizations is established is socially and culturally determined in the particular context (Hine, 2000).
The most direct academic hyperlink motivation study took a random sample of 414 hyperlinks between UK universities and classified them by the apparent motivation for their creation(Wilkinson et al., 2003). It was found that although fewer than 1% of hyperlinks targeted formal scholarly publications, such as a journal article or conference paper, over 90% of targeted material that was in some way related to research or other scholarly activity, such as teaching. This shows that Web hyperlinks are best viewed as data about informal scholarly communication. In fact they may be the most publicly available data source for informal scholarly communication, and hence have great potential.
It is likely that in the future, much more detailed studies of hyperlinking motivation will be undertaken in an attempt to gain a better understanding of the phenomenon, but it is expected that the results will be complex and highly context-dependent (Thelwall, 2002g).
Alternative document models 数据挖掘论坛
One discovery from the studies that analyzed hyperlinks between universities (Thelwall, 2001a; 2002b; 2003b) was that there were many cases in which one site contained thousands of hyperlinks to another, all created for essentially the same reason, and that this would show up as an anomaly in the data. A typical example would be the Web site of a collaborative research project where each page contains a standard hyperlinks bar that included a hyperlink to the home page of each partner institution. This violates the implicit assumptions of hyperlink analysis; that each hyperlink should be of approximately the same importance as the others. As an example, clearly 1000 automatically-generated hyperlinks should carry less weight than 1000 created by the decisions of different academics.
In order to circumvent this problem, alternative document models (ADMs) were created (Thelwall, 2002d), which aggregate hyperlinks together based upon directories, domains and whole sites instead of the page, as used in all previous research. The ADMs produced much better correlations with research productivity (Thelwall, 2002; Thelwall & Harries, 2003; Thelwall & Wilkinson, 2003a), showing the value of this approach. Their main drawback is that they are difficult to use if raw data is obtained from commercial search engines, since for each hyperlink count the exact hyperlink source and target URL must be known to perform the aggregation process. Programs are now freely available online to perform ADM aggregations to crawler data (Thelwall, 2001e). 数据挖掘交友
ADMs have been found flawed in two contexts. Sheffield University appears to use different domain names less frequently than other UK universities, turning it into an anomaly for the domain ADM (Thelwall, 2002a). Essex University hosts a database of hyperlinks to thousands of different German Web sites, creating an external hyperlinks anomaly (Thelwall, 2003a).
Key findings
Webometrics studies have yielded a number of interesting results (Smith & Thelwall, 2002; Tang & Thelwall, 2003; Thelwall, 2001c, 2002a, 2002c, 2002e, 2002g; Thelwall et al., 2003; Thelwall & Smith, 2002; Thelwall & Vaughan 2003). Among other indications, these results point to the need to study the Internet in a way that is sensitive to fields, as a dynamic, differentiated space, in which geographical notions still matter.
From these findings the concerns with online relationships and impact can be seen. What they do not show is the concern with data validation and methodological development that has characterized most Webometrics research and has been discussed above.
The key contributions of Webometrics to hyperlink analysis have been the development of methods for data collection, processing and validation. In addition, a range of general results has been generated about how the Web is used, primarily in academia, and establishing factors that influence web use or impact, as measured by hyperlink counts.
Clearly, the conclusion must be drawn that great care has to be taken in collecting data, processing it to remove the anomalies (e.g. using the Alternative Document Models if possible), and interpreting the results. Failure to do this risks arriving at incorrect results and misleading conclusions.
|