首页 | 人工智能 | 数据挖掘知识 | 相关研究方向 | 编程技术 | 电脑常识 | 互联网资源 | 交流论坛 | 免费书籍资料下载 | 论文下载 | 文档资料 | 在线手册
人工智能: 信息检索 商业智能 搜索引擎技术与新闻 神经网络 生物信息学 模式识别 知识工程 本体理论与方法 机器学习 决策支持 自然语言理解 专家系统 >>更多
数据挖掘知识:
数据挖掘论文 数据挖掘其他 数据挖掘工具与应用 时序模式 相关研究人员主页 相关方向求职招聘信息 文本挖掘 学位论文 异类 预测 web数据挖掘 >>更多
相关研究方向: 联机分析 信息抽取 小波变换 数据仓库 access数据库 DB2数据库 Mysql数据库 Oracle数据库 SqlServer数据库 Sysbase数据库 统计分析 >>更多
主页>数据挖掘知识>文本挖掘>

To Randomize or Not To Randomize: Space Optimal Summaries fo

Keywords

link-analysis, similarity search, scalability, data streams 字串8

1 Introduction

The idea of using hyperlink mining algorithms in Web search engines appears since the beginning of the success of Google′s PageRank [24]. Hyperlink based methods are based on the assumption that a hyperlink $ u 	o v$ implies that page $ u$ votes for $ v$ as a quality page. In this paper we address the computational issues [13,17,11,12] of personalized PageRank [24] and SimRank [16]. 字串9

Personalized PageRank (PPR) [24] enters user preferences by assigning more importance to the neighborhood of pages at the user′s selection. Jeh and Widom [16] introduced SimRank, the multi-step link-based similarity function with the recursive idea that two pages are similar if pointed to by similar pages. Notice that both measures are hard to compute over massive graphs: naive personalization would require on the fly power iteration over the entire graph for a user query; naive SimRank computation would require power iteration over all pairs of vertices.

字串3

We give algorithms with provable performance guarantees based on computation with sketches [7] as well as simple deterministic summaries; see Table 1 for a comparison of our methods with previous approaches. We may personalize to any single page from which arbitrary page set personalization follows by linearity [13]. Similarly, by our SimRank algorithm we may compute the similarity of any two pages or the similarity top list of any single page. Motivated by search engine applications, we give two-phase algorithms that first compute a compact database from which value or top list queries can be answered with a low number of accesses. Our key results are summarized as follows: 字串8

$ ullet$
We give practical methods for serving unrestricted on-line personalized PageRank (Section 2.1) as well as SimRank queries with space a reasonable constant per vertex (Section 3). The methods are based on deterministic rounding.
$ ullet$
We give a theoretically optimal algorithm for personalized PageRank value queries (Section 2.2) based on randomized sketching. Given an additive error $ epsilon $ and the probability $ delta $ of an incorrect result, we improve the disk usage bound from $ nlog nlogleft(1/deltaight)/epsilon^2$ [11,12] to $ nlogleft(1/deltaight)/epsilon$.
$ ullet$
We give theoretically optimal algorithms for SimRank value and top list queries (Section 3.1) by a nontrivial reduction of SimRank to personalized PageRank.
$ ullet$
We improve the communication complexity based lower
字串1

bounds of [11,12] for the size of the database (Section 4); our bounds are matched by our algorithms. Our sketch-based algorithms use optimal space; surprisingly for top list queries deterministic rounding is already optimal in itself.
$ ullet$
In Section 5 we experimentally analyze the precision of approximation over the Stanford WebBase graph and conclude that our summaries provide better approximation for the top personalized PageRank scores than previous methods.

字串2

字串5

Table 1: Comparison of personalized PageRank algorithms for graphs of $ n$ vertices, additive error $ epsilon $ and error probability $ delta $.

字串8

egin{table}%[ht!] egin{tabularx}{linewidth}{vert p{2.6cm}vert Xvert p{1... ...arx}$*$ optimal for top queries  $**$ optimal for value queries end{table}

1.1 Related Results

The scalable computation of personalized PageRank was addressed by several papers [13,18,17] that gradually increase the choice for personalization. By Haveliwala′s method [13] we may personalize to the combination of 16 topics extracted from the Open Directory Project. The BlockRank algorithm of Kamvar et al. [18] speeds up personalization to the combination of hosts. The state of the art Hub Decomposition algorithm of Jeh and Widom [17] computed and encoded personalization vectors for approximately 100K personalization pages.

字串4

To the best of our knowledge, the only scalable personalized PageRank algorithm that supports the unrestricted choice of the teleportation vector is the Monte Carlo method of [11]. This algorithm samples the personalized PageRank distribution of each page simultaneously during the precomputation phase, and estimates the personalized PageRank scores from the samples at query time. The drawback of the sampling approach is that approximate scores are returned, where the error of approximation depends on the random choice. In addition the bounds involve the unknown variance, which can in theory be as large as $ Omega(1)$, and hence we need $ Theta({1 over {epsilon^2}}log{1/delta})$ random samples. Indeed a matching sampling complexity lower bound for telling binomial distributions with means $ 1/2pmepsilon$ apart [1] indicates that one can not reduce the number of samples when approximating personalized PageRank. Similar findings of the superiority of summarization or sketching over sampling is described in [5]. The algorithms presented in Section 2 outperform the Monte Carlo method by significantly reducing the error. 字串4

We also address the computational issues of SimRank, a link-based similarity function introduced by Jeh and Widom [16]. The power iteration SimRank algorithm of [16] is not scalable since it iterates on a quadratic number of values, one for each pair of Web pages; in [16] experiments on graphs with no more than 300K vertices are reported. Analogously to personalized PageRank, the scalable computation of SimRank was first achieved by sampling [12]. Our new SimRank approximation algorithms presented in Section 3 improve the precision of computation. 字串7

The key idea of our algorithms is that we use lossy representation of large vectors either by rounding or sketching. Sketches are compact randomized data structures that enable approximate computation in low dimension. To be more precise, we adapt the Count-Min Sketch of Cormode and Muthukrishnan [7], which was primarily introduced for data stream computation. We use sketches for small space computation; in the same spirit Palmer et al. [25] apply probabilistic counting sketches to approximate the sizes of neighborhoods of vertices in large graphs. Further sketching techniques for data streams are surveyed in [23]. Lastly we mention that Count-Min Sketch and the historically first sketch, the Bloom filter [2] stem from the same idea; we refer to the detailed survey [4] for further variations and applications.

字串3

Surprisingly, it turns out that sketches do not help if the top $ t$ highest ranked or most similar nodes are queried; the deterministic version of our algorithms show the same performance as the randomized without even allowing a small probability of returning a value beyond the error bound. Here the novelty is the optimal performance of the deterministic method; the top $ t$ problem is known to cause difficulties in sketch-based methods and always increases sketch sizes by a factor of $ Omega (log n)$. By using $ Omega (log n)$ times larger space we may use a binary search structure or we may use $ p$ sketches accessed $ n^{1/p}$ times per query [7]. Note that $ n^{1/p}$ queries require an error probability of $ O(delta/n^p)$ that again increase sketch sizes by a factor of $ Omega (log n)$.

字串7

In Section 4 we show that our algorithms build optimal sized databases. To obtain lower bounds on the database size, we apply communication complexity techniques that are commonly used for space lower bounds [21]. Our reductions are somewhat analogous to those applied by Henzinger et al. [14] for space lower bounds on stream graph computation.

字串3

1.2 Preliminaries

We briefly introduce notation, and recall definitions and basic facts about PageRank, SimRank and the Count-Min sketch.

字串7


Personalized PageRank

Let us consider the web as a graph. Let $ n$ denote the number of vertices and $ m$ the number edges. Let $ d^+ (v)$ and $ d^- (v)$ denote the number of edges leaving and entering $ v$, respectively. Details of handling nodes with $ d^+ (v) = 0$ and $ d^- (v) = 0$ are omitted. 字串4

In [24] the PageRank vector $ p = (p(1)$, ..., $ p(n))$ is defined as the solution of the following equation $ p (u) = c cdot r (u) + (1 - c) cdot sum_{v : (vu) in E} p (v)/d^+ (v)$, where $ r = (r (1)$, ..., $ r (n))$ is the teleportation vector and $ c$ is the teleportation probability with a typical value of $ c approx 0.15$. If $ r$ is uniform, i.e. $ r (u) = 1/n$ for all $ u$, then $ {p}$ is the PageRank. For non-uniform $ {r}$ the solution $ {p}$ is called personalized PageRank; we denote it by PPR$ _r$. Since PPR$ _r$ is linear in $ r$ [13,17], it can be computed by linear combination of personalization to single points $ v$, i.e. to vectors $ r = chi_v$ consisting of all 0 except for node $ v$ where $ chi_v (v) = 1$. Let PPR$ _v=$PPR$ _{chi_v}$.

字串7

An alternative characterization of PPR$ _u(v)$ [10,17] is based on the probability that a length $ k$ random walk starting at node $ u$ ends in node $ v$. We obtain PPR$ _u(v)$ by choosing $ k$ random according to the geometric distribution: 字串9

PPR$displaystyle ^{[k]}_u (v) = sum_{v_0=u, v_1, ldots, v_k = v} hspace{-.4cm} {1 / (d^+ (v_0) cdots d^+ (v_{k-1}))};$ (1)
the summation is along walks starting at $ u$ and ending in $ v$. Thus
PPR$displaystyle _u (v) = sum_{k′ = 0}^infty c(1-c)^{k′}$ PPR$displaystyle ^{[k′]}_u (v).$ (2)

字串1


Similarly we get PPR$ ^{(k)}_u$ if we sum up only to $ k$ instead of $ infty$. An equivalent reformulation of the path summing formula (2) is the Decomposition Theorem proved by Jeh and Widom [17]:
PPR$displaystyle _u = c chi_u + (1-c) cdot sum_{v : (uv) in E}$ PPR$displaystyle _v / d^+ (u).$ (3)
The Decomposition Theorem immediately gives rise to the Dynamic Programming approach [17] to compute personalized PageRank that performs iterations for $ k=1$, 2, ...with PPR$ ^{(0)}_u = c cdot chi_u$:
PPR$displaystyle ^{(k)}_u = c chi_u + (1-c) cdot sum_{v : (uv) in E}$ PPR$displaystyle ^{(k-1)}_v / d^+ (u).$ (4)

字串7

SimRank

Jeh and Widom [16] define SimRank by the following equation very similar to the PageRank power iteration such that Sim$ ^{(0)} (u_1,u_2) = chi_{u_1} (u_2)$ and

字串6

Sim$displaystyle ^{(k)} (u_1,u_2) = egin{cases}(1-c) cdot {sum mbox{Sim}^{(k-... ... cdot d^- (u_2)} & mbox{if }u_1
ot=u_2\ 1& mbox{if $u_1=u_2$.} end{cases}$ (5)
where summation is for $ v_1,v_2 : (v_1 u_1),(v_2 u_2) in E$.

字串4
Count-Min Sketch

The Count-Min Sketch [7] is a compact randomized approximate representation of non-negative vector $ x = (x_1$, ..., $ x_n)$ such that a single value $ x_j$ can be queried with a fixed additive error $ epsilon>0$ and a probability $ delta>0$ of returning a value out of this bound. The representation is a table of depth $ d = ln {1/delta}$ and width $ w = e/epsilon$. One row $ C$ of the table is computed with a random hash function $ h: {1,ldots,n} 	o {1, ldots, w}$. The $ i$th entry of the row $ C$ is defined as $ C_i = sum_{j : h (j) = i} x_j$. Then the Count-Min sketch table of $ x$ consists of $ d$ such rows with hash functions chosen uniformly at random from a pairwise-independent family. 字串7

Theorem 1 (Cormode, Muthukrishnan [7]) Let $ hat x_j =min_{C} {C_{h (j)}}$ where the minimum is taken over the $ d$ rows of the table. Then $ hat x_j ge x_j$ and Prob$ (hat x_j > x_j + epsilonvertvert xvertvert _1) le delta$ hold.

Count-Min sketches are based on the principle that any randomized approximate computation with one sided error and bias $ epsilon′$ can be turned into an algorithm that has guaranteed error at most $ e cdotepsilon′$ with probability $ 1-delta$ by running $ ln 1/delta$ parallel copies and taking the minimum. The proof simply follows from Markov′s inequality and is described for the special cases of sketch value and inner product in the proofs of Theorems 1 and 2 of [7], respectively. 字串3


2 Personalized PageRank

We give two efficient realizations of the dynamic programming algorithm of Jeh and Widom [17]. Our algorithms are based on the idea that if we use an approximation for the partial values in certain iteration, the error will not aggregate when summing over out-edges, instead the error of previous iterations will decay with the power of $ 1-c$. Our first algorithm in Section 2.1 uses certain deterministic rounding optimized for smallest runtime for a given error, while our second algorithm in Section 2.2 is based on Count-Min sketches [7].

字串1

The original implementation of dynamic programming [17] relies on the observation that in the first $ k$ iterations of dynamic programming only vertices within distance $ k$ have non-zero value. However, the rapid expansion of the $ k$-neighborhoods increases disk requirement close to $ n^2$ after a few iterations, which limits the usability of this approach2. Furthermore, an external memory implementation would require significant additional disk space.

字串2

A simple example showing the
superiority of dynamic programming over power iterations
for small space computations

字串7

Figure 1: A simple example showing the superiority of dynamic programming over power iterations for small space computations.

字串4

We may justify why dynamic programming is the right choice for small-space computation by comparing dynamic programming to power iteration over the graph of Fig. 1. When computing PPR$ _u (w)$, power iteration moves top-down, starting at $ u$, stepping into its neighbors $ v_1, v_2, ldots$ and finally adding up all their values at $ w$. Hence when approximating, we accumulate all error when entering the large in-degree node $ w$ and in particular we must compute PPR$ _u (v_i)$ values fairly exact. Dynamic programming, in contrast, moves bottom up by computing the trivial PPR$ _w$ vector, then all the PPR$ _{v_i}$, then finally averages all of them into PPR$ _u$. Because of averaging we do not amplify error at large in-degrees; even better by looking at (4) we notice that the effect of earlier steps diminishes exponentially in $ (1-c)$. In particular even if there are edges entering $ u$ from further nodes, we may safely discard all the small PPR$ _u (v_i)$ values for further computations, thus saving space over power iteration where we require the majority of these values in order to compute PPR$ _u (w)$ with little error. 字串7

We measure the performance of our algorithms in the sense of intermediate disk space usage. Notice that our algorithms are two-phase in that they preprocess the graph to a compact database from which value and top list queries can be served real-time; preprocessing space and time is hence crucial for a search engine application. Surprisingly, in this sense rounding in itself yields an optimal algorithm for top list queries as shown by giving a matching lower bound in Section 4. The sketching algorithm further improves space usage by a factor of $ log n$ and is hence optimal for single value queries. For finding top lists, however, we need additional techniques such as binary searching as in [7] that loose the $ log n$ factor gain and use asymptotically the same amount of space as the deterministic algorithm. Since the deterministic rounding involves no probability of giving an incorrect answer, that algorithm is superior for top list queries.

字串5

The key to the efficiency of our algorithms is the use of small size approximate $ widehat{mbox{PPR}}_u^{(k)} (v)$ values obtained either by rounding and handling sparse vectors or by computing over sketches. In order to perform the update step of Algorithm 1 we must access all $ widehat{mbox{PPR}}_v$ vectors; the algorithm proceeds as if we were multiplying the weighted adjacency matrix $ A_{u v} = 1/d^+ (u)$ for $ (u v) in E$ with the vector $ {widehat{mbox{PPR}}_u (w) : u in V}$ parallel for all values of $ w$. We may use (semi)external memory algorithms [27]; efficiency will depend on the size of the description of the vectors. 字串9

The original algorithm of Jeh and Widom defined by equation (4) uses two vectors in the implementation. We remark that a single vector suffices since by using updated values within an iteration we only speed convergence up. A similar argument is given by McSherry [22] for the power iteration, however there the resulting sequential update procedure still requires two vectors.

字串5


2.1 Rounding


egin{algorithm} % latex2html id marker 1010 [t] small caption{PageRank by ... ...mbox{PPR}}_v/d^+ (u)Big) $ ENDFOR ENDFOR end{algorithmic}end{algorithm}
字串8

In Algorithm 1 we compute the steps of the dynamic programming personalized PageRank algorithm (4) by rounding all values down to a multiple of the prescribed error value $ epsilon $. As the sum of PPR$ _u(v)$ for all $ v$ equals one, the rounded non-zeroes can be stored in small space since there may be at most $ 1/epsilon$ of them. 字串9

We improve on the trivial observation that there are at most $ 1/epsilon$ rounded non-zero values in two ways as described in the next two theorems. First, we observe that the effect of early iterations decays as the power of $ (1-c)$ in the iterations, allowing us to similarly increase the approximation error $ epsilon_k$ for early iterations $ k$. We prove correctness in Theorem 2; later in Theorem 4 it turns out that this choice also weakens the dependency of the running time on the number of iterations. Second, we show that the size of the non-zeroes can be efficiently bit-encoded in small space; while this observation is less relevant for a practical implementation, this is key in giving an algorithm that matches the lower bound of Section 4. 字串2

Theorem 2 Algorithm 1 returns values between PPR$ _u (v) - 2epsilon/c$ and PPR$ _u(v)$.
Proof. By induction on iteration $ k$ of Algorithm 1 we show a bound that is tighter for $ k=k_{max}$ than that of the Theorem:
$displaystyle vert$PPR$displaystyle _u (w) - widehat{mbox{PPR}}_u (w)vert < {{1 over {1-sqrt{1-c}}} epsilon_k}. $
By the choice of $ epsilon $ and $ k_{max}$ we have $ epsilon_0 = 1$, thus the $ k=0$ case is immediate since $ 0 le$

上一篇:第四章: 网络游戏火热中洗牌   下一篇:第五章: 搜索引擎不可知的未来
版权申明:本站信息收集自互联网,仅供学习参考使用。若有违法转摘您的作品请email我们及时删除!  
用户名: 新注册) 密码: 匿名评论 所有评论
评论内容:(不能超过250字,需审核后才会公布,请自觉遵守互联网相关政策法规。
Google
8 热门推荐
  • More data isn’t always a good thing in
  • Text Categorization
  • Finding Advertising Keywords on Web Page
  • Communities from Seed Sets
  • Overview of Text Summarization History
  • Porter Stemming Algorithm
  • Sequential Minimal Optimization
  • 句子相似度计算在FAQ中的应用
  • 弱指导的统计隐含语义分析及其在跨语言信息
  • 基于文本概念和kNN 的跨语种文本过滤
  • 8 阅读排行
     
    版权所有:数据挖掘研究院 2004-2006 未经授权禁止复制或建立镜像
    增值电信业务经营许可证编号:皖B2-20040042 文网文:[2005]027号