Bootstrapping Semantics on the Web:Meaning Elicitation from
Keywords
Semantic Web, Meaning elicitation, Schema matching
1 Introduction
There is a general agreement that ``encoding some of the semantics of web resources in a machine processable form′′ would allow designers to implement much smarter applications for final users, including information integration and interoperability, web service discovery and composition, semantic browsing, and so on. In a nutshell, this is what the Semantic Web is about. However, it is less obvious how such a result can be achieved in practice, possibly starting from the current web. Indeed, providing explicit semantic to already existing data and information sources can be extremely time and resource consuming, and may require skills that users (including web professionals) may not have.
Our work starts from the observation that in many Web sites, web-based applications (such as web portals, e-marketplaces, search engines), and in the file system of personal computers, a wide variety of schemas (such as taxonomies, directory trees, Entity Relationship schemas, RDF Schemas) are published which (i) convey a clear meaning to humans (e.g. help in the navigation of large collections of documents), but (ii) convey only a small fraction (if any) of their meaning to machines, as their intended meaning is not formally/explicitly represented. Well-known examples are: classification schemas (or directories) used for organizing and navigating large collections of documents, database schemas (e.g. Entity-Relationship), used for describing the domain about which data are provided; RDF schemas, used for defining the terminology used in a collection of RDF statements. As an example, imagine that a multimedia repository uses a taxonomy like the one depicted in Figure 1 to classify pictures. For humans, it is straightforward to understand that any resource classified at the end of the path:
is likely to be a color picture of some lake in Trentino. However, this path is of little use for a standard search engine (perhaps the labels on the path can be matched with keywords, but this would not solve the usual problems of keyword-based search), or for a more semantic-aware application, as the meaning of the path is very partially encoded in the path itself, and in the labels used to name the elements of the path. Indeed, our understanding of the path heavily depends on a large amount of contextual and domain knowledge (e.g. that pictures can be in colors or black-and-white, that lakes have some geographical location, that Trentino is a geographical location, that pictures typically have a subject, and so on); and it is only the use of this knowledge which allows humans to infer that a file garda-panorama.jpg appended at the end of the path above is very likely to be a color picture containing a view of a lake called Garda located in Trentino, a region in the Italian side of the Alps. Similar arguments could be used for other structures, like the ER schema in Figure 2 or the RDF Schema in Figure 3.
In the paper, we present a general methodology and an implementation to make this rich meaning available and usable by computer programs. This is a contribution to bootstrapping semantics on the Web, which can be used to automatically elicit knowledge from very common web objects. The paper has two main parts. In the first part we argue that, in making explicit the meaning of a schema, most approaches tend to focus on what we call structural meaning, but almost completely disregard (i) the linguistic meaning of components (typically encoded in the labels), and (ii) its composition with structural meaning; our thesis is that this approach misses the most important aspects of how meaning is encoded in schemas. The second part of the paper describes our method for eliciting meaning from schemas, and presents an implementation called CTXMATCH2. In conclusion, as an example of an application, we show how the results of this elicitation process can be used for schema and ontology matching and alignment.
2 Meaningful (?) schemas
Consider three very common types of schemas: hierarchical classifications (HCs), ER Schemas and RDF Schemas. Examples are depicted in Figures 2, 3 and 1.
- Hierarchical Classifications.
- HCs are labeled tree-like structure whose purpose is to organize/classify data (e.g. documents1, web pages2, multimedia assets, goods3, activities, services4). An example of a HC is depicted in Figure 1;
Figure 1: An example of directory structure
 |
- Entity-Relation Schemas.
- Entity-Relation (ER) schemas are a widely used specification language for the conceptualization of the domain of data stored in a database. An example of ER is provided in Figure 2;
Figure 2: An example of ER Schema
 |
- RDF Schemas.
- An RDF schema is a specification of a vocabulary that can be used for expressing RDF statements. A tiny example of an RDF Schema can be found in Figure 3.
Figure 3: An example of RDF document
![egin{figure}{small t egin{tabular}{vert lvert} hline \ *[-4pt] <rdfs... ...3 Paper}′′/> </rdf:Property> \ *[5pt] hline end{tabular}} end{figure}](http://www2006.org/programme/files/xhtml/4066/4066-bouquet/4066-bouquet-img7.png) |
They are used in different domains (document management, database design, vocabulary specification) to provide a structure which can be used to organize information sources. However, there is a second purpose which is typically overlooked, namely to provide humans with an easier access to those data. This is achieved mainly by labelling the elements of a schema with meaningful labels, typiclaly from some natural language. This is why, in our opinion, it is very uncommon to find a taxonomy (or an ER schema, or an RDF Schema specification) whose labels are meaningless for humans. Imagine, for example, how odd (and maybe hopeless) it would be for a human to navigate a classification schema whose labels are meaningless strings; or to read a ER schema whose nodes are labeled with random strings. Of course, humans would still be able to identify and use some formal properties of such schemas (for example, in a classification schema, we can always infer that a child node is more specific than its parent node, because this belongs to the structural understanding of a classification), but we would have no clues about what the two nodes are about. Similar observations can be made for the two other types of schemas. So, our research interest can be stated as follows: can we define a method which can be used to automatically elicit and represent the meaning of a schema in a form that makes available to machines the same kind of rich meaning which is available to humans when going through a schema?
3 Structural and lexical analysis
We said that each node (e.g. in a HC) has an intuitive meaning for humans. For example, the node in Figure 1 can be easily interpreted as ``pictures of mountains in Tuscany′′, whereas can be interpreted as ``color pictures of mountains in Trentino′′. However, this meaning is mostly implicit, and its elicitation may require a lot of knowledge which is either encoded in the structure of the schema, or must be extracted from external resources. In [5], we identified at least three distinct levels of knowledge which are used to elicit a schema′s meaning:
- Structural knowledge:
- knowledge deriving from the arrangement of nodes in the schema;
- Lexical knowledge:
- knowledge about the meaning of words used to label nodes;
- Domain knowledge:
- real world knowledge about meanings and their relations.
Most past attempts focused only on the first level. A recent example is [19], in which the authors present a methodology for converting thesauri into RDF/OWL; the proposed method is very rich from a structural point of view, but labels are disregarded, and no background domain knowledge is used. As to ER schemas, a formal semantics is defined for example in [4], using Description Logics; again, the proposed semantics is completely independent from the intuitive meaning of expressions used to label single components. For RDF Schemas, the situation is slightly different. Indeed, the common understanding is that RDFS schemas are used to define the meaning of terms, and thus their meaning is completely explicit; however, we observe that even for RDFS the associated semantics (see http://www.w3.org/TR/rdf-mt/) is purely structural, which means that there is no special interpretation provided for the labels used to name classes or other resources.
However, as we argued above through a few examples, labels (together with their organization in a schema) appear to be one of the main sources of meaning for humans. So we think that considering only structural semantics is not enough, and may lead to at least two serious problems:
- we may be unable to discriminate between schemas that are structurally, but not semantically, isomorphic;
- we may be unable to make any conjecture on the meaning of edges connecting nodes (elements) of a schema.
The first issue can be explained through a simple example. Suppose we have some method for making explicit the meaning of paths in HCs, and that does not take the meaning of labels into account. Now imagine we apply to the path - in Figure 1, and compare to a path like in another schema (notice that typical HCs do not provide any explicit information about edges in the path). Whatever representation is capable of producing, the outcome for the two paths will be structurally isomorphic, as the two paths are structurally isomorphic. However, our intuition is that the two paths have a very different semantic structure: the first should result is a term where a class (``pictures′′) is modified/restricted by two attributes (``pictures of beaches located in Tuscany′′); the second is a standard Is-A hierarchy, where the relation between the three classes is subsumption. The only way we can imagine to explain this semantic (but not structural) difference is by appealing to the meaning of labels. We grasp the meaning of the first path because we know that pictures have a subject (e.g. beaches), that beaches have a geographical location, and that Tuscany is a geographical location. All this is part of what we called lexical and domain knowledge. Without it, we would not have any reason to consider ``pictures′′ as a class and ``Tuscany′′ and ``beaches′′ as values for attributes of pictures. Analogously, we know that (a sense of the word) ``dog′′ in English refers to a subclass of the class denoted by (a sense of the word) ``mammals′′ in English, and similarly for ``animals′′.
The second issue is closely related to the first one. How do we understand (intuitively) that refers to pictures of beaches located in Tuscany, and not e.g. to pictures working for Tuscany teaching beaches? After all, the edges between nodes are not qualified, and therefore any structurally possible relation is in principle admissible. The answer is trivial: because, among other things, we know that pictures do not work for anybody (but they may have a subject), that Tuscany can′t be the teacher of a beach (but can be the geographical location of a beach). It is only this body of background knowledge which allows humans to conjecture the correct relation between the meanings of node labels. If we disregard it, there is no special reason to prefer one interpretation to the other.
The examples above should be sufficient to support the conclusion that any attempt to design a methodology for eliciting the meaning of schemas (basically, for reconstructing the intuitive meaning of any schema element into an explicit and formal representation of such a meaning) cannot be based exclusively on structural semantics, but must seriously take into account at least lexical and domain knowledge about the labels used in the schema5. The methodology we propose in the next section is an attempt to do this.
4 Meaning representation
Intuitively, the problem of semantic elicitation can be viewed as the problem of computing and representing the (otherwise implicit) meaning of a schema in a machine understandable way. Clearly, meaning for human beings has very complex aspects, directly related to human cognitive and social abilities. Trying to reconstruct the entire and precise meaning of a term would probably be a hopeless goal, so our intuitive characterization must be read as referring to a reasonable approximation of meaning.
In our method, meanings are represented in a formal language (called WDL, for WordNet Description Logic), which is the result of combining two main ingredients: a logical language (in this paper, use the logical language which belongs to the family of Description Logics [2]), and IDs of lexical entries in a dictionary (more specifically, from WordNet [8], a well-known electronic lexical database). Description logics are a family logical languages that are defined starting from a set of primitive concepts, relations and individuals, with a set of logical constructors, and has been proved to provide a good compromise between expressivity and computability. It is supported with efficient reasoning services (see for instance [14]); WordNet is the largest and most shared online lexical resource, whose design is inspired by psycholinguistic theories of human lexical memory. WORDNET associates with any word ``word′′ a list of senses (equivalent to entries in a dictionary), denoted as , each of which denotes a possible meaning of ``word′′.
The core idea of WDL is to use a DL language for representing structural meaning, and any additional constraints (axioms) we might have from domain knowledge; and to use WORDNET to anchor the meaning of labels in a schema to lexical meanings, which are listed and uniquely identified as WORDNET senses. Indeed, the primitives of any DL language do not have an ``intended′′ meaning; this is evident from the fact that, as in standard model-theoretic semantics, the primitive components of DL languages (i.e. concepts, roles, individuals) are interpreted, respectively, as generic sets, relations or individuals from some domain. What we need to do is to ``ground′′ their interpretation to the WordNet sense that best represents their intended meaning in the label. So, for example, a label like can be interpreted as a generic class in a standard DL semantics, but can be also assigned an intended meaning by attaching it to the the first sense in WORDNET (which in version 2.0 is defined as ``a body of (usually fresh) water surrounded by land′′).
The advantage of WDL w.r.t. a standard DL encoding is that assigning an intended meaning to a label allows us to import automatically a body of (lexical) knowledge which is associated with a given meaning of a word used in a label. For example, from WORDNET we know that there is a relation between the class ``lakes′′ and the class ``bodies of water′′, which in turn is a subclass of physical entities. In addition, if an ontology is available where classes and roles are also lexicalized (an issue that here we do not address directly, but details can be found in [17]), then we can also import and use additional domain knowledge about a given (sense of) a word, for example that lakes can be holiday destinations, that Trentino has plenty of lakes, even that a lake called ``Lake Garda′′ is partially located in Trentino, and so on and so forth.
Technically, the idea described above is implemented by using WORDNET senses as primitives for a DL language. A WDL language is therefore defined as follows:
- the sets
, and of (names for) primitive concepts, roles and individuals of WDL are subsets of WORDNET senses;
- complex concepts can be defined with the following production rule
where , , and ;
- An axiom in WDL is an expression of the form
, where and are complex concepts, and , where is a concept and is an individual.
Some remarks are necessary.
- Unlike standard DL, in a WDL language, concepts, roles, and individuals, are not disjoint sets. This is required for modeling the fact that a word sense like
(as ``a determination of the place where something is′′) can be both a concept and a role. Formally, this is not a problem, as the context where a primitive object occurs makes it possible to determine whether it must be considered a concept, a role, or an individual.
- WDL has two semantics: a formal semantics and an intended semantics. The formal semantics is a mathematical function
that associates with each primitive concept a set of objects, to each primitive role a binary relation , and to each individual , an object . The formal semantics of complex concepts, and axioms can be defined inductively (see [2] for details).
- the intended semantics is a new (derived) sense, which might not be in WORDNET, and that can be associated with a gloss obtained by combining the glosses of the components. So, the intended semantics of
is ``a motor vehicle with four wheels; usually propelled by an internal combustion engine′′, which has ``a visual attribute ... that results from the light it emits or transmit or reflect′′, which is ``...the chromatic color resembling the hue of blood′′. In short, a red car.
Despite the fact that the intended semantics cannot be formally represented or easily determined by a computer, one should accept its existence and consider it at the same level as a ``potential′′ WordNet sense. Under this hypothesis we can assume that expressions in WDL convey meanings, and can be used to represents meaning in a machine. Put it differently, since the WDL primitives represent common-sense concepts, then the complex concepts of WDL will also represent common-sense concepts, since common-sense concepts are closed under boolean operations and universal and existential role restriction.
Example 1 Let us give some examples of the use of WDL descriptions to represents the meanings of the nodes of the schemas introduced in the previous section.
- The meaning of the node labeled with ``Publication′′ considered in the context of the ER schema of Figure 26 is
and the intuitive semantics is ``a copy of a printed work offered for distribution′′ that ``a human being′′, ``writes ... professionally ... ``.
- The meaning of the node labeled with paper is
and the intuitive semantics is ``an essay (especially one written as an assignment′′
- Finally, the meaning of the node
of the hierarchical classification of Figure 1 is
and the intuitive semantics is ``a visual representation produced on a surface of′′ ``areas of sand sloping down to the water of a sea or lake′′ ``situated in a particular spot or position′′ which is ``a region in central Italy′′
From this perspective, the problem of semantic elicitation can be thought of as the problem of finding a WDL expression for each element of a schema, so that the intuitive semantics of is a good enough approximation of the intended meaning of the node.
This section is devoted to the description of a practical semantic elicitation algorithm. This algorithm has been implemented as basic functionality of the CTXMATCH2 matching platform [17], and has been extensively tested in the 2nd Ontology Alignment Evaluation Initiative7.
In the following we will adopt the notation to denote the meaning of a node . to denote the label of the node, and or simply to denote the meaning of a label associated with the node considered out of its context. is also called the local meaning.
The algorithm for semantic elicitation is composed of three main steps. In the first step we use the structural knowledge on a schema to build a meaning skeleton. A meaning skeleton describes only the structure of a WDL complex concepts that constitutes the meaning of a node. In the second step, we fill nodes of with the appropriate concepts and individuals, using linguistic knowledge, and in the final step, we provide the roles, by exploiting domain knowledge.
Given a schema, the structural knowledge (structural semantics) associated with this schema provides the skeleton for the meaning of each node. Therefore our procedure will start from this skeleton, and will try to fill the gaps with the extra, implicit semantics, obtained from lexical and domain knowledge. In this section we will describe the structural knowledge which can be associated with each of the three types of schemas presented above, and how it can be used to produce a meaning skeleton.
Meaning skeletons are DL descriptions together with a set of axioms. The basic components of a meaning skeleton (i.e. the primitive concepts and roles) are the meanings of the single labels associated with nodes, denoted by ), and the semantic relations between different nodes (denoted by ). Intuitively represents a semantic relation between the node and the node . In the rest of this section we show how the meaning skeletons of the types of schema considered in this paper are computed.
A number of alternative formalizations for HCs have been proposed (e.g., [15,,]). Despite their differences, they share the idea that, in a HC, the meaning of a node is a specification of the meaning of its father node. E.g., the meaning of a node labeled with ``clubs′′, with a father node which means ``documents about Ferrari cars′′ is ``Ferrari fan clubs′′. In DL, this is encoded as , where is some node that connets the meaning of with that of . If the label of is for instance ``F40′′ (a Ferrari model) then the meaning of is ``documents about Ferrari F40 car′′, then it is the meaning of the label of that acts as modifier of the meaning of . In description logics this is formalized as . The choice between the first of the second case essentially depends both on lexical knowledge, which provides the meaning of the labels, and domain knowledge, which provides candidate relations between and . The following table summarizes some meaning skeletons associated with the HC provided above:
| node |
meaning skeleton |
|
*[2pt]  |
 |
|
*[3pt]  |
or |
|
| |
 |
|
*[3pt]  |
or |
|
| |
or |
|
| |
or |
|
| |
 |
|
Notice that, since at this level we do not have knowledge to distinguish which node is the modifier of the other, we have to consider all the alternative meaning skeletons.
Unlike HCs, the formal semantics for ER schemata is widely shared. In [4], one can find a comprehensive survey of this area. Roughly speaking, any ER schema can be converted in an equivalent set of DL axioms, which express the formal semantics of such a schema. This formal semantics is defined independently from the meaning of the single nodes (labels of nodes). Every node is considered as an atom. To stress this fact in writing meaning skeletons for ER, we will assign to each node an anonymous identifier. For instance we use to denote the 5 nodes of the schema of Figure 2.
If we apply the formal semantics described in [4] to the example of ER given above, we obtain the following meaning skeletons.
| node |
label |
meaning skeleton |
 |
Publication |
 |
 |
Author |
plus the axioms |
| |
|
, |
| |
|
 |
 |
Person |
 |
 |
Article |
 |
 |
Journal |
. |
This table can be read as follows: The meaning skeleton of node labeled with ``Publication′′ is a DL concept description , denoting any set of objects which are related to at least another object of some other set. The node labeled with ``author′′, is associated with a binary relation, that satisfies the associated domain and range axioms. Similar interpretation can be given to the other nodes. It is important to notice that the meaning skeleton is independent from the labels, and ER schemas which are structurally the same will produce meaning skeletons which are equal.
The meaning skeleton of the RDF Schema described in Figure 3 is provided by the formal semantics for RDF schema described for instance in [11]. Most commonly used RDFS constructs can be rephrased in terms of description logics, as discussed in [13]. As we did above, we report the meaning skeletons for some of the nodes of the RDF Schema of Figure 3 in a table, in which we ``anonymize′′ the nodes, by giving them meaningless names.
|
|
|
| 数据挖掘论坛导航 |
|
|
|
|
|
|
| 资讯点击排行帮 |
|
|
|
|
| 相关资讯 |
|
|
| 数据挖掘论坛资讯 |
|
|
|