William's blog: Web Spider

Web Spider, is a very image of the name. The Internet metaphor of a spider's web, the Spider is a spider crawling around on the Internet. Web spider is the link to find web pages, one page from the site (usually home), the reading page content, to find the other links in the page address, then address through these links to find the next page, so endless loop, until all the pages of this site are crawling last. If the entire Internet as a Web site, the spider can use this principle to all the pages on the Internet have crawled down.
For search engines to crawl all the pages on the Internet is almost impossible, from the current published data, the capacity of the largest search engines crawl the entire web page, but the number is around 40 percent. This is the reason one is crawling technology bottleneck, not through all the pages, there are many pages not linked from other pages to find; Another reason is that storage and processing technical problems, according to the average of each page calculated size of 20K (including pictures), 10 billion pages is 100 × 2000G bytes capacity, even if the ability to store, download is also problematic (as a machine calculation 20K per second download, you need to stop downloading 340 machines a year time to complete all the website). Also, because the data is too big, the search will provide an efficient impact. Therefore, many search engine spiders only crawl those important pages, and in the evaluation of the importance of crawling, when the main link to a page based on the depth.Crawling the web, spider web, there are two general strategies: breadth-first and depth-first (shown below). Breadth refers to the network of spiders to crawl the start page links to all pages in, and then choose one of the links page, continue to crawl the links on this page for all pages. This is the most common way, because this method allows parallel processing network spiders, to improve their crawl rate. Depth-first is the web spider will start from the start page, a link to a link to track down this route then processed and then transferred to the next start page, and continue to follow the link. This method has the advantage in the design of the network when the spider is easier. Difference between the two strategies, the following figure shows will be more clear.
Since it is impossible to crawl all the pages, some of spider's web site for some of the less important, set the access layer. For example, in the figure above, A is the start page, is 0 layer, B, C, D, E, F belong to the first layer, G, H belong to layer 2, I belong to Layer 3. If the network access layer spider set to 2, then I will not be web access to. This has also allowed some part of the site on the web to search in the search engine, and part can not be searched. For web designers, the flat structure of the site designed to help search engines crawl more of its pages.Spider web in the site visit it often encounter encrypted data and web access problems, and some pages are required to access member permissions. Of course, the owner of the site by agreement so that web spider can not crawl (in the following sections will explain), but for some of the sale of reports of the site, they want search engines to search their report, but not completely free to search person view, so you need to provide the appropriate network spider user name and password. Web spiders can be given permission for these web pages to crawl, to provide the search. When searchers click to view the page, the same need to search the appropriate permissions to provide verification.Website and web spiderWeb spiders need to crawl the web, different from the general's visit, if not properly managed, will lead to server overload. In April this year, Taobao because Yahoo search engine spiders crawl the data caused by Taobao server instability. Whether the site and the spider can not exchange it? In actual fact, there are several ways to make websites and web spiders to communicate. On the one hand the site administrator to understand where the spider came from, what it has done, it also tells spiders which pages should not be crawling, which pages should be updated.
Each network has its own name spider, crawling the Web, the site will indicate their identity. Web spiders will crawl pages when sending a request, the request will have a field for the User-agent, used to identify this spider's identity. For example, Google Web spiders identified as GoogleBot, Baidu spider logo for BaiDuSpider, Yahoo web spiders identified as Inktomi Slurp. If you have access to log on the website, the site administrator will be able to know which search engine spiders come over, and when over, and read how much data and so on. If the site administrator that there is a spider problem, and, through its logo to its owner. Here is a blog China May 15, 2004 search engine access log:Spider into a website, usually a special text file access Robots.txt, this file is generally on the web server's root directory, site administrator can define which directories robots.txt to spider can not access the network, or which directories For some specific web spiders can not access. For example, some web directories and executable files do not want to be the temporary file directory search engine search, then the site administrator can define these directories to deny access to the directory. Robots.txt syntax is very simple, for example, if there are no restrictions on the directory, you can use to describe the following two lines: User-agent: * Disallow:Of course, Robots.txt is just a protocol, if the spider web designers do not follow this agreement, the webmaster for web spiders can not prevent access to certain pages, but generally the spider will follow these protocols, and site administrators by other means to deny spiders crawl certain pages.
Spider's web page download time, will go to identify the page's HTML code, part of its code, META will be identified. With these logos, you can tell whether the network needs to be spiders crawl the web, the spider can tell the network link to this page need to be to continue tracking. For example: that this page does not require to be crawled, but the link on a page need to be tracked.The site is now generally want a more comprehensive search engines can crawl your site's pages, because it allows more visitors through search engines to find this site. To make this more comprehensive web site is crawled, the site administrator can create a site map, Site Map. Many networks will sitemap.htm file as a spider web crawling web portal, webmasters can put links to all pages within the site on which this file, then the network can easily spider the entire site crawled down to avoid omission of certain pages, will reduce the burden on the server.Content ExtractionSearch engines create web pages indexed, the object is a text file processing. For the spider, the crawl down the page includes a variety of formats, including html, image, doc, pdf, multimedia, dynamic web pages and other formats. These documents crawled down, you need to put these files in the extracted text information. Accurate information on these documents to extract the one hand, the accuracy of the search engine plays an important role, on the other the right track for web spiders have a certain impact on other links.
For the doc, pdf and other documents, this software provided by the manufacturer to generate professional documentation, vendors will provide the appropriate text extraction interface. Web spiders only need to call these plug-ins interface, you can easily extract the text in a document other information and documents relevant information.
Not the same as HTML and other documents, HTML will have its own syntax, through different command identifiers to represent the different fonts, colors, layout location, such as:,, etc., need to extract the text information of these identifiers are filtered out. Filtering of the identifier is not difficult, because these identifiers are certain rules, as long as appropriate in accordance with a different identifier information can be obtained. However, in recognition of such information, the need for simultaneous recording of many layout information, such as text font size, whether it is heading, whether the Bold, it is a page of words, etc. This information helps to calculate the word in a Web page degree of importance. Meanwhile, the HTML pages, in addition to the title and body outside, have a lot of advertising links and common channel links, text links and text that these relations do not, when the content in the extract, also need to filter these useless links . For example, a site has a "Products" channel, because the navigation bar of each page on the site has, if not filter the navigation bar links in the search for "Products", when the website will search every page, no doubt will bring a lot of spam. These links need to filter invalid statistical structure of the law of large number of pages, extract some common, unified filter; for some of the important results of special sites, also need to be dealt with individually. This requires the design of web spiders have a certain scalability.
For multimedia, images and other documents, usually through the link anchor text (ie, link text) and related documentation notes to determine the contents of these files. For example, there is a link to the text as "Maggie Cheung Photo", which links to a bmp format images, then the spider will know this picture is on "Maggie's pictures." Thus, in the search for "Maggie" and "Photo" at all times so that the search engine to find this picture. In addition, many multimedia file has the file attributes, these attributes can also be considered to better understand the contents of the file.Dynamic network of spider web has been facing problems. The so-called dynamic pages, as opposed to static pages, it is automatically generated by the page, so that the benefits can quickly change the page style uniform, can also reduce the share of web server space, but also to network with spiders crawl to some trouble. As the development of language continue to increase, the type of dynamic pages more and more, such as: asp, jsp, php, etc.. These types of pages for the spider, it may also slightly easier. Web spiders are more difficult to deal with some of the scripting language (such as VBScript and JavaScript) to generate the page, if you want to improve the handling of these Web pages, the spider needs to have a script interpreter. For many data on the database's Web site, you need to search through the site's database to get information on these web spiders to crawl brought great difficulties. For such sites, if the site designers hope that these data can be search engines, you need to provide a way to traverse the entire database content. For web content extraction, has been an important technology spider. The whole system is generally in the form of a plug, a plug-in management services through the program, encountered the page in different formats using different plug-in processing. This approach has the advantage of scalability is good, every subsequent discovery of a new type, it can make its approach to the plug-in to add a plug-in management services procedures.Update cycle
As the site's content changes often, so the network also need to constantly update the spiders to crawl the content of web pages, which requires a certain period in accordance with the spider web to scan the site to see which pages need to update the pages, which pages are added pages, which pages are dead links have expired.
Update cycle on the search engine search engines search the recall rate has great influence. If the update period is too long, then there will always be part of the new generation of Web search not; cycle is too short, the technology will be difficult, and would have the bandwidth, server resources are wasted. Search engine spiders not all sites are used to update the same period, some important updates for large sites, update cycle is short, as some news sites, updated every few hours; contrary to some important website, updated on a long period, may only updated once a month or two.In general, spider, when updating the website content, do not re-crawl the web page again, for most web pages, web pages only need to determine the properties (mainly dates), to get the attributes and the last crawl properties compared to if the same is not updated.

William's blog

Thursday, November 17, 2011

Web Spider

No comments:

Post a Comment