Network reptiles ( also known as webpage spider, internet robot, in the FOAF community centre, more often called webpage chaser ), is a kind of according to certain rules, automatically crawl web information program or script. Other less frequently used names and ants, automatic indexing, simulation program or worm.
Introduction
With the rapid development of Internet, World Wide Web has become a carrier of information, how to effectively extract and use these information to become a great challenge. Search engine ( Search Engine ), such as the traditional general search engine AltaVista, Yahoo! And Google, as an aid to people to retrieve information on the tools to become the user access to the world wide web of the entrance and guide. However, these general search engines have certain limitations exist, such as:
( 1) in different areas, different background users often have different retrieval purposes and requirements, the general returned by the search engine results contains a large number of users do not care about the webpage.
( 2) the general search engine 's goal is as large as possible coverage rate, limited search engine server resources and unlimited network data the contradiction between resources will further deepen.
( 3) World Wide Web data in the form of wealth and the continuous development of network technology, database, pictures, audio / video multimedia data appeared in large numbers, general search engines tend to the information content is concentrated and structured data incapable of action, not well discovery and acquisition.
( 4) the general search engine to provide most keywords retrieval, difficult to support according to the semantic information inquiry.
In order to solve the above problem, directionally grabbing related webpage resources emerge as the times require focused crawler. Focused crawler is an automatic download webpage program, it established the grasping object, selective access world wide web webpage and related links, obtain the information needed. And general crawler ( general? Purpose web crawler ), focused crawler does not pursue greater coverage, and will aim to capture a particular subject content related to the webpage, for subject oriented to user queries for data resources.
1 focused crawler working principle and key technology
Network reptile is an automatic extraction webpage program, it for the search engines are downloaded from the world wide web webpage, search engine is the important composition. Traditional reptiles from one or several initial webpage URL begin, obtained initial webpage on the URL, webpage crawling process, continuously from the current page to extract new URL queue, until meet the system must stop condition. Focused crawler workflow is relatively complex, need according to certain webpage analysis algorithm to filter irrelevant links, retention of useful links and put it into the waiting to grab URL queue. Then, it will be according to certain search strategy is selected from the queue the next step is to capture URL webpage, and repeat the process, until it reaches the system when a certain condition stop. In addition, all by the crawler to crawl the webpage will be the system storage, were analyzed, filtered, and indexed, so after query and retrieval; for the focused crawler, the process of the analysis results also may have on the subsequent capture process give feedback and guidance.
Compared with the general web crawler, focused crawler also need to solve three major problems:
(1 ) to grasp the object definition or description;
( 2) on the webpage or data analysis and filtering;
( 3) to URL search strategy.
The grasping object description and definition is to decide webpage analysis algorithm and URL search strategy of how to develop the foundation. While the webpage analysis algorithm and candidate URL sorting algorithm is to determine the search engine provided by the service forms and reptile webpage crawling behavior is the key. The two part of the algorithm are closely related.
2 grasping target description
Focused crawler to crawl existing target description can be categorized based on the goal webpage feature, based on the target data model and based on the concept of 3.
Based on the goal webpage features of reptiles crawling, storage and indexing objects generally website or webpage. According to the seed samples can be divided into:
( 1) a predetermined initial grasp seed samples;
( 2) a predetermined webpage classification directories and directory corresponding to the seed samples, such as Yahoo! Classification structure;
( 3) through the user behavior determined by grasping target sample, divided into:
A ) the process of user browsing display annotation grab sample;
B ) by a user log mining to get access mode and related samples.
Among them, webpage features can be webpage content features, can also be a webpage link structure, etc..
No comments:
Post a Comment