RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎

Semantically Enhanced Collaborative Filtering on the Web(2)

来源: 作者:unkonwn 时间:2004-12-02 点击:

Computing Predictions

After computing the similarity between items, we select a set of $k$ most similar items to the target item and generate a predicted value for the target item. We use a weighted sum as follows.
数据挖掘研究院

  数据挖掘研究院

egin{displaymath}
M_{a,t} = frac{{sumlimits_{j = 1}^k {(M_{a,j} 	imes sim(i_j,i_t))}}} {{sumlimits_{j = 1}^k {sim(i_j,i_t)}}}
end{displaymath}

Here, $M_{a,t}$ denotes the prediction value of target user $u_a$ on target item $i_t$. Only the $k$ most similar items ($k$ nearest neighbors of item $i_t$) are used to generate the prediction. Despite their effectiveness, item-based CF algorithms still suffer from the problems associated with data sparsity, and they still lack the ability to provide recommendations or predictions for new or recently added items. To deal with these problems, we introduce an approach for semantically enhanced collaborative filtering in which structured semantic knowledge about items, extracted automatically from the Web, is used in conjunction with user-item ratings (or weights) to create a combined similarity measure for item comparisons. This approach is discussed in the next section. 数据挖掘实验室


数据挖掘研究院

Using Semantic Knowledge to Enhance Collaborative Filtering

数据挖掘研究院

In this section, we first discuss the issue of extracting structured semantic attributes from the Web to populate instances of domain-specific ontology classes corresponding to items. We then present our approach to integrate the extracted semantic knowledge into the item-based collaborative filtering framework.

Extracting Domain Semantics from the Web

In order to obtain semantic information about items used in the collaborative filtering process, we must extract domain-level structured objects as semantic entities contained within Web pages on one or more Web sites. This task involves the automatic extraction and classification of objects of different types into classes based on an underlying reference domain ontology. 数据挖掘研究院

An ontology provides a set of well-founded constructs that define significant concepts and their semantic relationships. An example of an ontology is a relational schema for a database involving multiple tables and foreign keys semantically connecting these relations. Such constructs can be leveraged to build meaningful higher level knowledge in a particular domain. Domain ontologies for a Web site usually include concepts, subsumption relations between concepts (concept hierarchies), and other relations among concepts that exist in the domain represented by the Web site. In this paper, we do not directly deal with the problems of automatic ontology acquisition and learning. Rather, we assume the existence of a pre-defined reference ontology for a specific domain based on which the semantic attributes of items can be extracted. Our goal is to use this semantic knowledge about items together with item ratings (or weights in the context of Web usage data) to create a combined similarity measure for item-based collaborative filtering.

数据挖掘研究院

The problem of extracting instances of the ontology classes from Web pages is an interesting problem in its own right and has been studied extensively. This process can be viewed as the classification of objects embedded in one or more Web pages into classes specified as part of a reference ontology. For example, in [10] a text classifier is learned for each "semantic feature" based on a small manually labeled data set. First Web pages are extracted from different Web sites that belong to a similar domain, and then the semantic features are manually labeled. This small labeled data set is fed into a learning algorithm as training data to learn the mappings between Web objects and concept labels. Craven et al. [7] adopt a combined approach of statistical text classification and first-order text classification in recognizing concept instances. In that study, the learning process is based on both page content and linkage information. The problems and issues related to using ontologies in the context of Web mining has been discussed in [3]. 数据挖掘研究院

In our approach, we have used domain-specific wrapper agents that use text mining and heuristic rules to extract class and attribute instances from Web sites based on a pre-specified reference ontology. At the present time, we do not use a general ontology representation language, such as DAML+OIL [12]. Rather, we represent the ontology classes as part of the schema for a relational database. Our simple representation scheme does not take into account complex relationships among classes (such as inheritance), but is adequate for specifying the attributes associated with classes (relations). Our wrapper agents use the relational schema for classes and simple heuristics based on textual cues to extract attribute values and populate instances of these classes (tuples). In the future, we intend to extend our work by incorporating more general ontology languages that can capture (and allow reasoning with) a richer set of structural relationships among classes and objects. The implementation details of the wrapper agents is beyond the scope of the present work and will be discussed elsewhere. 数据挖掘研究院

Figure 1: Portion of the ontological representation for a movie Web site
egin{figure}
            centerline{psfig{file=movie-ontology.eps,width=3.75 in}}
            end{figure}

As an example, let us consider a movie Web site such as the Internet Movie Database (www.imdb.com). This Web site includes a collection of pages containing information about movies, actors, directors, etc. A collection of pages describing a specific movie might include attribute information such as the movie title, genre, actors, director, etc. These represent the attributes associated with a class that represents movies in our reference ontology. A domain ontology for this site may contain the classes Movie, Actor and Director along with their attributes. In our representation, some of the attributes represent properties of a given class and others represent reference slots corresponding to other classes. For instance, the "Actor" attribute of the Movie class represents a reference to the class (relation) Actor and, in the relational representation, is specified as a foreign key in the Movie relation. Figure 1 depicts the class Movie and its attributes. An actor or director′s attribute information may include name, filmography (a set of movies), gender, nationality, etc. The dotted arrows in attributes such as "Actor" and "Director" indicated that they represent references to other classes in the ontology. The collection of Web pages in the site represent a group of embedded objects that are the instances of these classes.

In order to facilitate the computation of item similarities, generally, the extracted class instances will need to be converted into a vector representation. In our case, the values of semantic attributes associated with class instances are collected into a relational table whose rows represent the $n$ items, and whose columns correspond to each of the extracted attributes. Additional preprocessing tasks, such as normalization and discretization (for continuous attributes), can be performed on the data in order to provide a uniform representation. This process generally results in the addition of attributes, for example, representing different intervals in a continuous range, or representing each unique discrete value for categorical attributes in the original data. The final result is a $n 	imes d$ matrix $S$, where $d$ is the total number of unique semantic attributes. We call this matrix the semantic attribute matrix.

Integrating Semantic Similarity with Collaborative Filtering

As noted earlier, the item-based CF framework provides a computational advantage over user-based approaches, since item similarities can be computed offline, prior to the online task of generating recommendations. But, this framework also provides another important advantage. Since the computation of item similarities is independent of the methods used for generating predictions or recommendations, other sources of evidence about items (in addition to item ratings or weights) can be used for performing the similarity computations.

The integration of semantic similarities for items with rating (or usage-based) similarities provides two primary advantages. First, the semantic attributes for items provide additional clues about the underlying reasons for which a user may or may not be interested in particular items (something that is hidden behind the rating values in the usual context). This, in turn, allows the system to make inferences based on this additional source of knowledge, possibly improving the accuracy of recommendations. Secondly, in cases where little or no rating (or usage) information is available (such as in the case of newly added items, or in very sparse data sets), the system can still use the semantic similarities to provide reasonable recommendations for users. 数据挖掘研究院

In the following we describe our approach for integrating semantic similarities into the standard item-based collaborative filtering framework. Our approach involves first performing latent semantic analysis on the semantic attribute matrix obtained using the process described in Section 3.1. This is necessary in order to reduce noise and to collapse highly correlated attributes. We then compute item similarities, both based on the reduced semantic attribute matrix, as well as based on the user-item ratings (or usage) matrix. Finally, we use a combined similarity measure, as a linear combination of the two similarities to perform item-based collaborative filtering.

数据挖掘研究院

Using Latent Semantic Analysis on Semantic Attributes.

Latent Semantic Indexing (LSI) [4] is a dimensionality reduction technique which is widely used in information retrieval (IR). Many IR applications have shown that performing latent semantic analysis, including in document indexing, can improve the accuracy of information retrieval. Given a term-document frequency matrix, LSI is used to decompose it into two matrices of reduced dimensions and a diagonal matrix of singular values. Each dimension in the reduced space is a latent variable (or factor) representing groups of highly correlated index terms. Reducing the dimensionality of the original matrix reduces the amount of noise in the data as well as its sparsity, thereby, improving retrieval based on the computation of similarities between the indexed documents and user queries. Here we apply this idea to create a reduced dimension space for the semantic attributes associated with items.

Singular Value Decomposition (SVD) is a well known technique used in LSI to perform matrix decomposition. In our case, we perform SVD on the semantic attribute matrix $S_{n 	imes d}$ by decomposing it into three matrices: 数据挖掘研究院

egin{displaymath}
S_{n 	imes d} = U_{n 	imes r} ullet Sigma _{r 	imes r} ullet V_{r 	imes d}
end{displaymath}

where $U$ and $V$ are two orthogonal matrices; $r$ is the rank of matrix $S$, and $Sigma$ is a diagonal matrix of size $r 	imes r$, where its diagonal entries contain all singular values of matrix $S$ and are stored in decreasing order. One advantage of SVD is that it provides the best lower rank approximation of the original matrix $S$ [4]. We can reduce the diagonal matrix $Sigma$ into a lower-rank diagonal matrix $Sigma_{k 	imes k}$ by only keeping $k$ ($k < r$) largest values. Accordingly, we reduce $U$ to $U′$ and $V$ to $V′$. Then the matrix $S′ = U′ ullet Sigma′ ullet V′$ is the rank-$k$ approximation of the original matrix $S$. In the above process, $U′$ consists of the first $k$ columns of the matrix $U$ corresponding to the $k$ highest order singular values. In the resulting semantic attribute matrix, $S′$, each item is, thus, represented by a set of $k$ latent variables, instead of the original $d$ attributes. This results in a much less sparse matrix, improving the results of similarity computations, as well as the computational cost associated with the process. Furthermore, the generated latent variables represent groups of highly correlated attributes in the original data, thus potentially reducing the amount of noise associated with the semantic information. As we will illustrate in the next section, performing latent semantic analysis on the semantic space, generally leads to substantial gains in prediction accuracy based on the semantic attributes.

Predictions Based on a Combined Similarity Measure.

The semantic similarity measure SemSim(ip , iq), for a pair of items ip and iq, is computed using the standard vector-based cosine similarity on the reduced semantic space. This process can be viewed as multiplying the matrix $S′$ by its transpose and normalizing each corresponding row and column vector by its norm. This results in a $n 	imes n$ square matrix in which an entry (i,j) corresponds to the semantic similarity of items i and j.

Similarly, we compute item similarities based on the user-item matrix $M$. As noted in Section 2, in the case of usage data, we use the cosine similarity measure. In the case of ratings data (such as movie ratings) we employ the adjusted cosine similarity in order to take into account the variances in user ratings. We denote the rating (or usage) similarity between two items ip and iq as RateSim(ip , iq).

数据挖掘研究院

Finally, we combine these two similarity measures to get CombinedSim as their linear combination: 数据挖掘研究院

egin{displaymath}
CombinedSim(i_p,i_q) = alphacdot SemSim(i_p,i_q) + (1-alpha)cdot RateSim(i_p,i_q)
end{displaymath}

where $alpha$ is a semantic combination parameter specifying the weight of semantic similarity in the combined measure. If $alpha = 0$, then we have CombinedSim(ip , iq) = RateSim(ip , iq), in other words we have the standard item-based filtering. On the other hand, if $alpha = 1$, then only the semantic similarity is used which, essentially, results in a form of content-based filtering. Finding the appropriate value for $alpha$ is not a trivial task, and is usually highly dependent on the characteristics of the data. We choose the proper value by performing sensitivity analysis for particular data sets in our experimental section below.

In order to compute predicted ratings or recommendations, we use the weighted sum approach discussed in Section 2. Specifically, 数据挖掘研究院

egin{displaymath}
M_{a,t} = frac{{sumlimits_{j = 1}^k {(M_{a,j} 	imes CombinedSim(i_j,i_t))}}} {{sumlimits_{j = 1}^k {sim(i_j,i_t)}}},
end{displaymath}

where, $M_{a,t}$ denotes the prediction value of target user $u_a$ on target item $i_t$. 数据挖掘实验室


最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?