RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎
当前位置 :| 首页>编程技术>p2p>

P2P Search Engines-Introduction

来源: 作者:unkonwn 时间:2004-12-01 点击:

Introduction

How traditional Search Engines work and their disadvantages [Richard Lee]


In a couple of minutes you will know exactly how a peer-to-peer search engine works. In case you don′t already know however, this section will first introduce you to the concept of a search engine and describe how traditional, client-server based search engines work. If you already know all this, feel free to skip straight to the next section. 数据挖掘研究院

The World Wide Web (WWW) consists of literally billions of web pages, spread across thousands and thousands of servers all over the world. Since it would be physically impossible for an individual to sift through and examine all these pages to locate specific information, search engines exist to do this work for you.

数据挖掘研究院

Search engines use special software called "spiders" which roam around the web, automatically following hyperlinks from one document to the next and extracting important textual information from them. It uses this information to build up a huge index correlating keywords to web pages. Search engines can′t just invent urls from which spiders start their crawl however. Instead all searches originate from web pages specified manually by human users and the spider subsequently follows the links on these pages to others. You might expect that this would result a great deal of the web left undiscovered by spiders. this is indeed true - a very large portion of the web is completely invisible to spiders - still, google claims to have a massive 2 billion web pages in its database.

数据挖掘研究院

When a user enters a query into a search engine from their browser, their input is processed and used to search the database for occurrences of particular keywords. The web pages it finds are ordered using a ranking algorithm unique to each search engine.

Obviously the highest ranked page should be the one the user is most likely to be interested in. Ranks are usually assigned based on a weighted average of a number of relevance scores, which are derived from the number of occurrences of a word in a page, if it appears in the page title, or in a heading, or the url itself, or the meta tag, and so on. After these web pages are sorted based on relevance, the hit list is communicated back to the browser where the user can explore the results. 数据挖掘研究院

 

So today′s search engines do not actually search the web directly. They merely refer to a database located on a centralised server. Of course, since web pages can (and frequently do) change at any time, this means that when you “search the web” you are really searching through stale copies of web pages, which may no longer be relevant or even exist. You will only find out if this is the case however after time is spent (trying to) retrieve the current version of the web page. So to tackle the issue of web documents being modified, moved, deleted or renamed, the index database needs to be updated continually and comprehensively to maintain high quality search results and reduce the amount of broken links reported to the user.

数据挖掘研究院

  数据挖掘实验室

Naturally, searching a billion or more web pages for a specific piece of information is a very compute intensive task. This leads to very complex and expensive hardware and software strategies to reduce search times. Powerful, dedicated servers are needed both to feed urls to spiders during their crawl and to process the data (the order of megabytes per second is usual) returned by spiders. Google, as an example, has it’s own dedicated domain name server (dns), simply to bypass the overhead of querying a world wide dns. furthermore, companies like google have massive storage and bandwidth requirements, which don’t come cheap. On the software side, custom compression solutions (to reduce storage costs) and algorithms such as hashing (for fast indexing of the database) are amongst the techniques used to ensure efficiency and speed. 数据挖掘实验室

So while current search engines are adequate for general web searching, they still have many disadvantages. They are costly to build and maintain, they inadequately handle dynamic web pages whose content changes frequently, they are ignorant of the vast majority of web pages, which are unreachable by spiders, and the information they reference can quickly go out-of-date. Of course, all these problems are exacerbated by the rate at which the internet continues to grow, making it virtually impossible for any centralised search engine to repeatedly visit and index all publicly accessible web pages. 数据挖掘研究院


 

Peer-to-Peer Network Search Engines [Edmund So]

Effective discovery methods must rely on a larger variety of information about the desired resources, typically in the form of metadata.

数据挖掘研究院

The major types of discovery methods are described here 数据挖掘研究院

  • Flooding broadcast of queries

    The original gnutella implementation is an example of a flooding broadcast discovery mechanism. when a peer makes a query, the query is then broadcasted to all the neighbour peers. If its neighbour peers could not solve the query, then the query is broadcasted to neighbour’s neighbour peers. If a resource is found, that peer will send a message to the origin sender of the query, indicates it can solve the query, and then establish a peer-to-peer connection.

    数据挖掘研究院

    Each query has a time-to-live (ttl) counter. Typically, the ttl is set between 5 and 7, and the value is decremented by each node as it relays the message. Another counter tracks the number of hops. Once the ttl counter reaches zero, the query will be discarded.

    Due to the broadcast nature of each query, the system does not scale well (o(n2)), the bandwidth network assumption grows exponentially with a linear increase in the number of peers. Raising the number of peers in the system will cause the network quickly reach bandwidth saturation. 数据挖掘研究院

    This type of method has the advantage of flexibility in the processing of queries. Each peer can determine how it will process the query and respond accordingly. It is simple to design and efficient. unfortunately, it is suitable only for small networks. As well as that, this type of mechanism is very susceptible to malicious activity, rogue peers can send out large number queries, which produce a significant load on the network. 数据挖掘研究院

  • Selective forwarding systems

    Selective forwarding systems are more scalable then flooding broadcast systems. Instead of sending a query to all peers, it is selectively forward to specific peers who are considered likely to locate the resource. Peers will become super peer automatically if they have sufficient bandwidth and processing power, i.e. if a peer has broadband connection and higher processing power. Peers with dial-up connection (low bandwidth) will make queries to super peers. this type of systems use flow control algorithm(fca), it tries to apply a form of intelligent flow control to how a peer forward request and response messages and a sensible priority scheme to how it drops messages that won’t fit into the connections.

    数据挖掘研究院

    With this approach, it greatly reduces bandwidth limitations to scalability. but it is susceptible to malicious activity, a rogue peer can insert itself into the network at the various points and misroute queries, or discard them altogether. 数据挖掘研究院

    Each peer must also contain some amount of information used to route or direct queries received. the size of this information is negligible in a small network, but in large networks, this overhead may grow to levels that are unsupportable, hence it is not suitable for a large peer network. 数据挖掘研究院

  • Decentralized hash table networks 数据挖掘实验室

    In decentralized hash table networks, each file stored with in the system is given a unique id, typically an sha-1 hash of its content, which is used to identify and locate a resource. Given this unique id, a resource can be located quickly despite the size of the network. Since each resource is identified by this key, it is impossible to perform a fuzzy or keyword search within the network. if a peer is looking for a file from another peer, it must obtain this key first in order to retrieve the file.

    These systems are also susceptible to malicious activity by rouge peers. They may discard a query, insert large amount of frivolous data to clutter the keyspace, or flood the network with queries to degrade the performance.

    数据挖掘实验室

  • Centralized indexes and repositories 数据挖掘实验室

    Napster uses centralized indexed and repositories system. Index of all peers and their resources are kept on large indices on main server. A query is sent to a server, then the server will look-up the index, if the query can be solved, then the server will send a message to the origin query sender, about where he could get the file. 数据挖掘研究院

    Centralized indexes have provided the best performance for resource discovery. The server in centralized indexes and repositories system is expensive, the bandwidth and hardware required to support large networks of peers are expensive. 数据挖掘研究院

    If the server in the system fails to function properly, it brings down the network. in the case of napster, it has a cluster of servers, if one server fails, the rest of the servers will continue to support the network.

    Recent court rulings cast serious doubt about the liability involved in using centralized servers to index resources in a peer based network. It has been said that the recent legal precedents require any such system to monitor usage and activity of the network exactly to ensure that no types of copyright violations are occurring. The ability to monitor and enforce this requirement is quite challenging, and may be too much of a risk. 数据挖掘实验室

  • Distributed indexes and repositories

    数据挖掘研究院

    With this approach, we could eliminate the need for expensive centralized servers, the idea of distributed index is that, each content broker in the network keeps an index of local files as well as an index of some files stored in some neighbouring content broker, when a content broker receives a query from a peer, it first checks to see if the query can be satisfied locally. If it cannot, it uses the local index to decide which content broker to forward the request to. The index on each server is not static and changes as files move through the system. 数据挖掘实验室

    If it is well designed, it provides the best performance and scalability of any solution, it also has a very high tolerance of signal point failure, because a content broker only contains a relative small indexes in compare with centralized server, so if one content broker goes down, the network will still function properly.

    数据挖掘研究院

    The most difficult problem with this type of indexing systems is cache coherence of all the indexed data. if a file is changed locally by a peer, then the content broker would not aware that particular file has been changed, later, when another peer request that particular file, the content broker will return an out-of-date copy of that file. The overhead in keeping everything up-to-date and efficiently distributed is a major detriment to scalability.

    数据挖掘研究院

    Peers joining and leaving the network from time to time, when a peer leaves a network, all the resources indexes stored in that peer will become unavailable to other peers. 数据挖掘研究院

    Distributed indexing systems as they currently exist cannot provide robust discovery in large networks.

    数据挖掘实验室

  • Relevance driven network crawlers 数据挖掘研究院

    Relevance driven network crawlers are a different approach to the resource discovery problem. Instead of performing a specific query based on peer request, they use a database of existing information the peer has accumulated to determine which resources it encounters may or may not be relevant or interesting to the peer. 数据挖掘研究院

    Over time, a large amount of information is accrued which is analysed to determine what common elements the peer has found relevant. the crawler then traverses the networks, usually consisting html documents for new information, which matches the profile distilled from the previous peer information.

    数据挖掘研究院

    The time requires for the crawler to traverse a large amount of content is very long, it is not suitable for large networks. 数据挖掘研究院


 

Current P2P Search Implementations – Michael Collins

Two main models of p2p networks for file sharing have evolved: 数据挖掘研究院

Searching centralized server-client p2p networks 数据挖掘研究院

Searching on a centralized p2p network is made easy by the presence of a single central server system which maintains directories of the shared files stored on the respective pcs of every user on the network. When a user searchs for a file the central server creates a list of files matching the search request, by cross-checking the request with the server′s database of files belonging to users who are currently connected to the network. The central server then displays that list to the requesting user, the requesting user can then choose files from the list and make direct connections to the individual computers which currently posses that file. 数据挖掘研究院

Advantages of the server-client architecture 数据挖掘实验室

The principle advantage of the server-client architecture is the central index which locates files quickly and efficiently. Also because all clients have to be registered as part of the network search requests reach all logged on clients which ensures the search is as through as possible. 数据挖掘研究院

Disadvantages of the server-client architecture

The central server system provides a single point of failure and a visible target for legal attacks on the network. Also because the central server index is only updated periodically there is a possibility of a client recieving outdated information. 数据挖掘研究院

Server-client p2p networks currently operating today include:

  • opennap - Open source, a network of servers running the napster server-client protocol.

  • kazaa - uses a semi-centralized architecture, and is the underlying network for morpheus and grokster. A proprietary protocol, it uses a self-organizing network that automatically assigns more tasks to peers with higher bandwidth. 数据挖掘研究院

  • edonkey - also a semi-centralized network. anyone is free to run a server. There are loosely connected, separate index servers. when a client sends a search request to a server only that server is searched, once the search is completed the client can send the same request to the next server in its list if necessary. Once a file is in the clients download queue further servers are queried. This is because edonkey was the first client application to simultaneously download from multiple sources and it also supports sharing of partially downloaded files so other clients can start downloading a file from you whilst you are still downloading it. 数据挖掘研究院

  • filetopia - a spanish developed client-server application which uses strong public key encryption to ensure security and anonymity. 数据挖掘研究院

Searching decentralized p2p networks

The concept of decentralization is to remove the central structure of a network such that each peer can communicate as an equal to any other peer. When a peer (a) connects to a decentralized network it connects to another peer (b) to announce that it is live, b will then in turn announce to all peers it is connected to (c, d, e, f etc.) that a is alive, c, d, e, f etc repeating the pattern. Once a has announced that it is alive it can send a search request on to b, which in turn passes it on to c, d, e, f etc. If for example c has a copy of the file requested by a it transmits a reply to b which passes it back to a which can then open a direct connection to d and download the file. As illustrated by this flash animation.

Although this theoretically allows for an infinite network, in practice a time to live (TTL) is used to control the number of nodes a request can reach. 数据挖掘研究院

Advantages of a decentralized architecture 数据挖掘研究院

They are more rugged, because a single point of failure is eliminated. They are also harder to shut down.

数据挖掘研究院

  数据挖掘研究院

Disadvantages of a decentralized architecture 数据挖掘研究院

Searching a decentralized network is slower. You are not guaranteed to find a file even if it is on the network because it may be too far away for a search request to reach the peer which has it before the TTL expires. 数据挖掘研究院

Decentralized p2p networks currently operating today include: 数据挖掘研究院

  • Gnutella is by far the most popular alternative to the opennap network. The protocol has been reverse engineered and put under the control of an open source foundation. It has also been extended to support hashing, quality of service and multi-source downloading. 数据挖掘实验室

  • mnet grew out of a commercial product mojonation. it is a universal file space, however a filesharing application was the first application developed for it.

  • The freenet project is more then just a file sharing p2p application. it allows completely anonymouse information exchange. It also implements caching and each filename is hashed so servers do not know what they are storing and uses strong encryption.

  • gnunet is an attempt to create an anonymous, distributed, reputation based network in which all connections are authenticated and encrypted. 数据挖掘研究院


 

The origins of JXTA search – Sean Reilly

JXTA search originated with Gene Khan and a company he founded called infrasearch. Infrasearch was developed after khan realised that gnutella was a distributed searching network and could be used to access all manner of data completely independent of the of format (i.e. not just mp3s etc.). He found that typical web crawlers had stale data taking weeks for newly posted documents to be available on the web and also that crawlers didn’t access large databases that were open to the web especially after a “?” in the url. To solve these problems he came up with a prototype of infrasearch base on gnutella and using the gnutella backend. The basic idea behind infrasearch was to distribute the query to the edges of the network and let the intelligence of the peer it is being sent to process it in whatever format is appropriate for the query and respond. Infrasearch was bought by sun microsystems in march 2001 and the development team were incorporated into sun’s JXTA project, with the intention of using the ideas developed by infrasearch to develop JXTA search the search method used in project JXTA. The infrasearch team acquired by sun moved away from using the gnutella backend and developed their own xml based protocol which was based in some ways on the gnutella protocol, e.g. the notion of letting each peer process the queries as it sees fit, and the distribution of queries among peers. 数据挖掘实验室

JXTA search

The JXTA search engine consists of the following participants, 数据挖掘实验室

  • JXTA search information providers , anything that responds to requests formatted in the JXTA search qrp language information providers may be JXTA peers or web servers, such as cnn.com. 数据挖掘实验室

  • JXTA search consumers, anything that makes requests in the JXTA search qrp language. consumers may be JXTA peers or web sites with http client interfaces to the JXTA search network.

  • JXTA search hub, a mechanism that facilitates efficient query routing over the JXTA search network by handling message routing between consumers and providers. Providers register a description of themselves with the hub and wait for matching requests. Consumers query the network through the hub and await responses.

     

Hubs can also be providers or consumers they can be chained together in a network. 数据挖掘研究院

Applications send requests to their nearest search hub which then forwards these requests to appropriate service providers based on meta-data they provide when registering with the search hub. the provider then sends back the response to the hub that it requested it and it then travels from hub to hub until it reaches the application that requested it. 数据挖掘研究院

Wide and deep search methods

JXTA search has two complementary search techniques wide and deep search. 数据挖掘研究院

Wide and deep search in the JXTA network source: search.JXTA.org 数据挖掘研究院

  • Wide search

    In the JXTA search context wide search means sending data from search hub to search hub. These search hubs are an efficient compromise between a central server bottleneck issue on one extreme and the wasted bandwidth/resources issue of a completely connected network. Each hub is intended to be specialized in some way i.e. content application etc. and if the hub cannot answer the request (or if it is intended to go as wide as possible) it forwards the request to other hubs. 数据挖掘研究院

  • Deep search

    Web crawlers often have stale data and are only effective for static content such as home pages etc. dynamic pages such as news sites etc. are not effectively indexed by web crawlers. Deep search engines find data in large databases such as amazon or large news sites. JXTA search decides which queries should be sent to them and returns the results. This is intended by the JXTA developers to give a much wider better coverage of the data then conventional search engines.

The future of JXTA search

Suns project JXTA has been released to the open source community in an initiative that sun hope will help project JXTA become the number one peer2peer platform. This is a very important step in project JXTAs development because in the end it will be the number of people using the JXTA platform that decides whether or not project JXTA will be a success or not. At present sun claim to have over 10,000 members in the project JXTA community and still growing. including the open source community in the project will i think result in much bigger interest from the developer community and therefore many more applications using the JXTA p2p platform being written. hopefully in the near future we will see some successful implementations of JXTA search coming into widespread use and so we will be able to properly gauge the success or otherwise of JXTA search.

数据挖掘实验室


Further information and references: 数据挖掘研究院


数据挖掘研究院

“peer to peer”, bo leuf, addison wesley 2002 数据挖掘研究院

http://www.guardian.co.uk/online/story/0,3605,478269,00.html

数据挖掘研究院

http://searchenginewatch.com/sereport/01/03-p2p.html

http://www.howstuffworks.com/search-engine.htm

数据挖掘研究院

http://www.howstuffworks.com/news-item127.htm 数据挖掘研究院

http://www.howstuffworks.com/napster.htm 数据挖掘研究院

http://www.wired.com/news/technology/0,1282,41850,00.html

http://news.com.com/2100-1023-253141.html?legacy=cnet 数据挖掘研究院

http://searchenginewatch.com/sereport/00/06-music.html 数据挖掘研究院

http://mnet.sourceforge.net/doc.php

http://www.infoanarchy.org/wiki/wiki.pl?file_sharing 数据挖掘研究院

http://www.limewire.com/index.jsp/p2p 数据挖掘实验室

http://search.JXTA.org/JXTAsearch.pdf

数据挖掘研究院

http://www.openp2p.com/pub/a/p2p/2001/06/06/JXTAsearch.html

“distributed search in peer-to-peer networks” steve  waterhouse, david m.  doolin, gene  kan, yaroslav  faybishenko 数据挖掘研究院

http://computer.org/internet/

数据挖掘研究院


Aside: What is meta data? [Edmund So]

Metadata is sometimes defined literally as ‘data about date’, it has label like “title”, “author”, “type”, height” and “language” used to describe a book, person, article etc. 数据挖掘研究院

Example of metadata in an html document

数据挖掘研究院


数据挖掘研究院

<html>

数据挖掘研究院

<head> 数据挖掘研究院

<title>how does peer-to-peer search engines work</title>

数据挖掘研究院

<meta name=”description” content=”this article addresses…”>

数据挖掘研究院

<meta name=”creator” content=”metadata, rdf, peer-to-peer”>

数据挖掘研究院

<meta name=”publisher content=”trinity college dublin computer science department”>

<meta name=”date” content=”2002-11-24t00:12:00+00:00”> 数据挖掘研究院

<meta name=”type” content=”article”> 数据挖掘研究院

<meta name=”language” content=”en-ie” 数据挖掘研究院

….. 数据挖掘研究院

….. 数据挖掘实验室

</head>

数据挖掘研究院

….. 数据挖掘实验室

….. 数据挖掘实验室

</html> 数据挖掘研究院


数据挖掘研究院

The original html mechanism for embedding metadata has been proven limited. There is no built-in convention to control the names given to the various embedded metadata fields, and these fields are often ignored by search engine.

数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?