Hyperlink Analysis: The Webometrics Approach

Information science interest in hyperlinks started in 1996 and has been mainly driven by analogies with citations in journal articles. The fields of bibliometric and scientometrics make extensive use of citations both to help assess the quality of academic work and to trace patterns of scholarly communication (Borgman & Furner, 2002; Wouters, 1999). The underlying assumptions are that more important, or higher-quality articles, will tend to be cited more, and that citations often indicate that the work in the cited article has been built upon or otherwise used by the citing article (Cronin, 1984). In fact, reasons for citing are extremely diverse (Borgman & Furner, 2002) but citation analysis remains an effective, if controversial, tool that is used for a variety of purposes (Garfield, 1979; Oppenheim, 1997; Moed, 2002).

Two early articles examined the use of hyperlinks to track Web information (Larson, 1996; Rodríguez Gairín, 1997). An extensive theoretical discussion by Almind and Ingwersen (1997) also set the foundations for and gave a name to the new field of Webometrics. The event that triggered Webometrics was the deployment by commercial search engines such as AltaVista of an interface that allowed anybody to count links between large web spaces with a simple command. This made it possible to think about creating techniques to exploit this new facility and to begin to speculate about and investigate potential new applications. The information scientists who noted this potential naturally turned to their own disciplines to look for applications, and the apparently close analogy between hyperlinks and citations, both being the referencing of one document by another, gave them a ready-made set of research questions and techniques through the adaptation of citation analysis. 数据挖掘工具

Rousseau (1997) popularized the term sitation for a Web hyperlink, foregrounding the citation analogy. Aguillo (1998) at the same time started the e-journal Cybermetrics and began an extensive investigation into various aspects of Web and Internet use, including hyperlinks. The analogy between hyperlinks and citations has continued to generate interest within information science, including speculations about the kind of information that they could reveal in different contexts (Borgman & Furner, 2002; Björneborn. & Ingwersen, 2001; Cronin, 2001; Davenport & Cronin, 2000; Thelwall, 2002i).

The starting point for Webometrics, then, was the attempt to apply citation analysis to the Web context. Since citation analysis tracks (to some extent) scholarly communication, some researchers have sought to use hyperlink counts as a measure of the extent of online communication between the owners of two or more sets of Web pages. Other citation analysis attempts to evaluate bodies of work through their citation counts, which has lead to a second type of Webometrics approach: to see whether link counts can be valid measures of online impact. This has lead to investigations into whether pages attract hyperlinks primarily for the quality or interest level of their contents, so that hyperlink counts would measure some kind of online impact.

The starting points in terms of methods are not just formulae and algorithms for computing useful information, but also a wide range of data validation techniques, partly a legacy of the continued controversy surrounding evaluative citation analysis.

Data collection: Web crawlers and commercial search engines

Early Webometric studies used commercial search engine advanced queries to obtain hyperlink counts, primarily AltaVista and AllTheWeb. For example, entering the query host:wlv.ac.uk AND link:knaw.nl into the AltaVista advanced query section could be used to count the number of pages with domain names ending in wlv.ac.uk that contain a hyperlink that includes the text knaw.nl. This could be expected to give a count of Wolverhampton University (
http://www.wlv.ac.uk ) pages that hyperlink to the Koninklijke Nederlandse Akademie van Wetenschappen (KNAW, Royal Netherlands Academy of Arts and Sciences, http://www.knaw.nl ), either the main site or a subdomain in each case. Such queries gave easy access to useful data from the huge search engine databases. One of well-known drawbacks is that no search engine can index the whole Web (Thelwall, 2002h ) and actual coverage in 1999 appeared to be below 16% for the major search engines of the time (Lawrence & Giles, 1999). If a commercial search engine is used, then clearly this is a limitation that must be accepted and discussed in the study.

数据挖掘交友



A second major problem, however, was immediately discovered: that the results returned by search engines fluctuated irregularly, sometimes dramatically (Bar-Ilan, 1999; Mettrop & Nieuwenhuysen, 2001; Rousseau, 1999; Snyder & Rosenbaum, 1999; Thelwall, 1999). In order to combat this, Rousseau (1999) proposed multiple search rounds and an averaging process.

Additionally, search engines have exhibited peculiarities in behavior in the past. Snyder and Rosenbaum (1999) reported that AltaVista had the following problems:
The AltaVista metaterm ′link′ is intended to retrieve the total number of pages each of which has at least one link to a specified page. In practice, the metaterm frequently fails to retrieve all, or even most of the linkpages. Conversely, the ′link′ command sometimes retrieves pages that do not contain the link specified. (p. 380).
A more recent systematic comparison of AltaVista′s results with that of a specialist crawler found it to be highly reliable for UK university sites (Thelwall, 2001a). This particular problem seems to have disappeared, although there is no guarantee that other problems will not occur in the future.

A logical alternative is to create a specialist information science Web crawler to ensure reliable access to data, as called for by Bar-Ilan (2001). Such a tool has now been developed and extensively used (Thelwall, 2001ef) but this is not on a scale to rival commercial search engines, being capable of crawling all universities in a single country within a month, but not capable of giving significant international coverage. As a result, and helped by improvements in search engine reliability (Thelwall, 2001g; Vaughan & Thelwall, 2003), subsequent research has used both approaches.

In situations where the results are suspected to be unreliable, or because the query sent cannot be specific enough to capture the data, human observation can be used either as an additional filtering step, or a random sample can be taken to estimate the accuracy of the results (Cronin et al., 1998). This approach could be used, for example to ensure that the hyperlink still connected to the desired Web site (or Web page) in case the site (or page) had moved or been deleted since the hyperlink was created. In this case, error messages such as "The server does not have a DNS entry" or "404 URL Not Found" would often be obtained.

Data analysis methodologies

A key article that provoked many follow-up investigations was that of Ingwersen (1998), which introduced the Web Impact Factor (WIF). This is a metric designed to assess the impact of an area of the Web based upon counting the number of hyperlinks to it. In fact many different variants were proposed and tested but the most initially successful were the external absolute WIF and the external relative WIF. The former is simply the number of pages outside the area being measured that contain a hyperlink to it. The latter is this figure divided by the number of pages in the target area. Both of these omit hyperlinks between pages within the target area, these being often for navigation purposes and are therefore not useful indicators of external impact.

These or other hyperlink metrics have been applied to various Web spaces (Ingwersen, 1998). A summary of the different spaces analyzed is listed in Table 1. Key findings will be summarized separately later.


Research Areas Articles
Journal articles Goodrum et al., (2002)
Journal Web sites Smith (1999a), Harter & Ford (2000), Vaughan & Hysen (2002), Vaughan & Thelwall (2003)
Countries Ingwersen, (1998), Thelwall (2001b)
Sectors within collections of countries Leydesdorff & Curran (2000)
National collections of universities Smith (1999a), Thelwall & Harries, (2003), Thelwall & Wilkinson (2003a), Thelwall (2000, 2001ad, 2002a-f, 2003ab)
International collections of universities Smith (1999b), Polanco et al. (2001), Smith & Thelwall (2002), Thelwall & Smith (2002), Thelwall & Tang (2003), Thelwall et al. (2003)
Academic departmental Web sites Darmoni et al. (2000), Thomas & Willett (2000), Chu et al. (2002), Douyère et al., (2002), Li et al. (2002), Tang & Thelwall (2003)
Other academic-related collections of sites or pages Hernandez-Borges et al. (1999), Cui (1999), Björneborn (2001), Soualmia et al. (2002)
Commercial Web sites Thelwall (2001bc)
数据挖掘论坛
Table 1. Research areas and articles.


One problem with the relative WIF is that counts of pages within an area of the Web can be substantially more unreliable than hyperlink counts. This is due to various factors including mirror site inclusion and design decisions about Web page sizes and format (Thelwall, 2001a). In response to this, academic WIFs have also used university staff numbers as their denominator (Thelwall, 2001a, 2003b), giving improved results. An alternative approach has been to dispense with the WIF denominator altogether and to model the data using the raw hyperlink counts, which arguably gives more intuitive results (Thelwall, 2002b).

Other analytic approaches have included a variety of multivariate statistical techniques and pathfinder network scaling (Thelwall, 2002e) but these have all been found wanting when applied to national academic Webs, because of a poor fit between the underlying assumptions of the statistical tests and the multiple trends present in the data. A simpler approach, applicable for small collections of sites analyzed, is the network diagram where arrows are drawn connecting sites with thickness proportional to a hyperlink-based calculation (Thelwall, 2001b). Even this approach has its limitations, however, with four potential arrow thickness metrics giving different results and problems of scale occurring with domains of different sizes (Thelwall & Smith, 2002). The four metrics used for arrow thickness are as follows.
  • Total link counts
  • Total links divided by total number of pages in the target site(s)
  • Total links divided by total number of pages in the source site(s)
  • Total links divided by total number of pages in the source and target site(s)
Each method gives a different perspective on the data set and it can be useful to produce all four. Total link counts paint a picture of the total linking among the set whereas the last one factors out size so that the underlying tendency to link can be seen. In contrast, dividing by target site pages gives an indication of which sites attract the most links per page, and where these links come from. Dividing by source size gives an indication of which sites host the most links per page, and the sites that they target.

数据挖掘交友



Validating Web hyperlink counts: Correlations and motivations

When a new data source is found, it is important to validate it in terms of reliability, representativity and its potential uses. Typically, this process involves theoretical studies of the data source as well as statistical exercises to compare the data with other more known quantities (Oppenheim, 2000). As a result, early Webometric studies compared WIFs with other impact measures. It was found that WIFs did not correlate highly with journal impact factors for e-journals (Harter & Ford, 2000; Smith, 1999a) and did not seem to be related to the research impact or quality of universities or departments (Smith, 1999a; Thelwall, 2000; Thomas & Willett, 2000). These results were a major blow for Webometrics, but then a whole series of positive results showed that the approaches did have potential. A study of 25 UK universities made the breakthrough by showing a significant correlation between WIFs and average research quality, both for AltaVista data and specialist crawler data (Thelwall, 2001a). Since then, further significant results have been found for a larger group of 109 UK universities (Thelwall, 2002d), Australian universities (Smith & Thelwall, 2002) and Taiwanese universities (Thelwall & Tang, 2003). In addition significant correlations have now been found for journal Web sites (Vaughan & Hysen, 2002; Vaughan & Thelwall, 2003) and academic departments (Chu et al., 2002; Li et al., 2002; Tang & Thelwall, 2003). As a result of these studies, it can now be concluded that counts of hyperlinks to academic-related Web sites frequently strongly associate with research quality. This does not imply, however, that there is a cause-and-effect relationship between the two. To address this issue, it is necessary to find out why hyperlinks are created and to study motivations for hyperlinking.
数据挖掘工具

Some of the earlier correlation-based studies included exercises that attempted to identify the motivations behind hyperlink creation. After investigating various hypotheses to explain these (including the Matthew effect) (Thelwall, 2001a, 2002b, 2003b), it seems that there is greater Web-related activity in more institutions which produce more research.

At the individual researcher level, Kim (2000) explored the influences on a researcher′s hyperlinking behavior, based on a focus interview with 15 scholars who included external (also called outgoing) hyperlinks in their academic papers. Although the hyperlinks in scholarly electronic articles are created as the result of a variety of motivations, he found that scholarly and social motivations are as strong as technological reasons. At the interorganizational level, in the case study of ArXiv.org and SPIRES-HEP (
http://www.slac.stanford.edu/spires/hep) which are a Web-accessible pre-print article server and an extensive bibliographic database respectively, Kling et al. (2001) found that hyperlinks between two organizations were made by a variety of social, economic, and technological relationships such as a joint grant research and a Web site interface. Although hyperlinking technology was born of the advancement of computer technologies, whether and how hyperlinking among people and organizations is established is socially and culturally determined in the particular context (Hine, 2000).

The most direct academic hyperlink motivation study took a random sample of 414 hyperlinks between UK universities and classified them by the apparent motivation for their creation(Wilkinson et al., 2003). It was found that although fewer than 1% of hyperlinks targeted formal scholarly publications, such as a journal article or conference paper, over 90% of targeted material that was in some way related to research or other scholarly activity, such as teaching. This shows that Web hyperlinks are best viewed as data about informal scholarly communication. In fact they may be the most publicly available data source for informal scholarly communication, and hence have great potential.

It is likely that in the future, much more detailed studies of hyperlinking motivation will be undertaken in an attempt to gain a better understanding of the phenomenon, but it is expected that the results will be complex and highly context-dependent (Thelwall, 2002g).

Alternative document models 数据挖掘论坛


One discovery from the studies that analyzed hyperlinks between universities (Thelwall, 2001a; 2002b; 2003b) was that there were many cases in which one site contained thousands of hyperlinks to another, all created for essentially the same reason, and that this would show up as an anomaly in the data. A typical example would be the Web site of a collaborative research project where each page contains a standard hyperlinks bar that included a hyperlink to the home page of each partner institution. This violates the implicit assumptions of hyperlink analysis; that each hyperlink should be of approximately the same importance as the others. As an example, clearly 1000 automatically-generated hyperlinks should carry less weight than 1000 created by the decisions of different academics.

In order to circumvent this problem, alternative document models (ADMs) were created (Thelwall, 2002d), which aggregate hyperlinks together based upon directories, domains and whole sites instead of the page, as used in all previous research. The ADMs produced much better correlations with research productivity (Thelwall, 2002; Thelwall & Harries, 2003; Thelwall & Wilkinson, 2003a), showing the value of this approach. Their main drawback is that they are difficult to use if raw data is obtained from commercial search engines, since for each hyperlink count the exact hyperlink source and target URL must be known to perform the aggregation process. Programs are now freely available online to perform ADM aggregations to crawler data (Thelwall, 2001e).

数据挖掘交友



ADMs have been found flawed in two contexts. Sheffield University appears to use different domain names less frequently than other UK universities, turning it into an anomaly for the domain ADM (Thelwall, 2002a). Essex University hosts a database of hyperlinks to thousands of different German Web sites, creating an external hyperlinks anomaly (Thelwall, 2003a).

Key findings

Webometrics studies have yielded a number of interesting results (Smith & Thelwall, 2002; Tang & Thelwall, 2003; Thelwall, 2001c, 2002a, 2002c, 2002e, 2002g; Thelwall et al., 2003; Thelwall & Smith, 2002; Thelwall & Vaughan 2003). Among other indications, these results point to the need to study the Internet in a way that is sensitive to fields, as a dynamic, differentiated space, in which geographical notions still matter.

From these findings the concerns with online relationships and impact can be seen. What they do not show is the concern with data validation and methodological development that has characterized most Webometrics research and has been discussed above.


The key contributions of Webometrics to hyperlink analysis have been the development of methods for data collection, processing and validation. In addition, a range of general results has been generated about how the Web is used, primarily in academia, and establishing factors that influence web use or impact, as measured by hyperlink counts.

Clearly, the conclusion must be drawn that great care has to be taken in collecting data, processing it to remove the anomalies (e.g. using the Alternative Document Models if possible), and interpreting the results. Failure to do this risks arriving at incorrect results and misleading conclusions.
[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:A Survey of Hyperlink Network Studies
下一篇:Conclusions
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Mercator: A Scalable, Extensible Web Cra
  • 什么是垂直搜索引擎(之二)
  • Writing a web crawler
  • 互联网搜索的未来
  • 国家版权局版权司副司长许超:关于搜索引擎
  • 百度数分钟内闪电裁员 企业软件事业部遭抛
  • 我对垂直搜索引擎的几点认识
  • Google Patent Filings by the Dozen
  • Manageability - Open Source Web Crawlers
  • 微软卡位第三代搜索技术 认为Google将很快
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 谷歌宣布进军可替代能源 计划投资4.4万亿美
  • 搜索大战成Web 2.0操作系统之争
  • 7月美国搜索市场环比增长2% 雅虎微软成输家
  • 网页面向搜索引擎的搜索引擎优化
  • 史上最具技术创新的10大搜索引擎
  • Google如何预测下一届美国总统
  • 微软1亿美元收购语义搜索引擎Powerset
  • 很黄很暴力:人肉搜索引擎
  • OpenSocial只不过是Google公关骗局
  • 数据之美 百度GOOGLE统计的秘密
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静