Search engines crawl the web to understand the process helps clarify the SEO work direction, this article will and other resources, and personal understanding to explore and search engines to crawl process for SEO guidance.
First, a brief look at the process of search engine spiders, as shown:
The figure depicts a brief SE crawl principle, although there is communication between distributed information gathering needs in reptiles, but the process will be for a single reptile probably as shown, do the following for each analytical steps:
A Total Link Library
The total inventory stood links crawling reptile had been URL and the time the new URL , controlled by the scheduling system to extract the new URL , or need to re-visit the URL by crawling reptiles. The total link library stored URL is unique not repeated, thus ensuring no repeat reptile crawl to avoid falling into the trap cycle.
What’s new with respect to the entire Internet, SE resources are limited. Full crawl is impossible, you need to SEwith minimum cost to crawl the most significant portions of what it would need to crawl priority deployment strategy. Scheduling system, to be crawled URL into a queue structure, fetching strategy will play a sort of these queues role.
Many reptiles crawl strategy, but its goal is to give priority to important pages crawled . Common example: breadth-first traversal policy, depth-first traversal strategy, PR priority strategy, anti-chain priority strategy, OPIC strategy, major stations priority strategies.
Breadth-First Traversal Policy
Breadth-first traversal policy refers to grab a starting page, all links within the page into the crawl queue until the end, not a page’s importance rating, according to the order crawl, as traversal path: A BCDE HFG ;
Depth-First Traversal Policy
Depth-first traversal policy refers to a crawl after the start page, choose one of these links to track crawl crawl until completed, the next start page, continue to follow the crawl, traverse the path shown: AB CFG D EH ;
PR Priority Strategy
In this refers to incomplete PR strategy, because PR is a page for all the algorithms, and reptiles crawl collection process only for a web page PR terms, it is called incomplete PR priority strategy. In this strategy, according to the queue to be crawled URL or imperfect PR value to determine the crawl order. Of course, the PR value is not calculated once every crawl a page, but a certain amount crawl as X pages later, all the download page again recalculates a new non-complete PR value. According to the PR value that determines the queue to be downloaded URL to download the order. Without crawling to X before a page, a new page in the extracted crawl URL importance may be higher than the previous URL , these URL to PR to 0 at the end of the queue to be downloaded is inappropriate. At this point, it is necessary to calculate a provisional anti-chain based on this page for all PR , inferior race to be downloaded into the queue;
Anti-Chain Priority Strategy
means the number of pages to be linked to other pages to determine the queue to be crawled URL crawl sequence;
OPIC Strategy ,
Online Page Importance Computation computing, online page importance. This strategy is similar toPR priority, in essence, is given to the page of the “quality points.” Before the beginning of the algorithm, all the pages given to the same initial “cash” (Cash) , when a page is downloaded, the page will be their “cash” equally among all the links page, and empty their cash. To be crawled URL , based on the amount of cash the importance of sorting crawl.
Dazhan priority strategy means to be crawled URL , sorting through its ownership of the domain name in accordance with the priority to be downloaded download URL and more number of links.
Other strategies, such as based on URL of the target level, URL suffix and URL strings, etc. to determine the crawl ordering.
In actually captured in a variety of strategies are often used in combination. Above strategies for SEO to improve the collection there are a lot of significance, such as: the number of control into the chain, the chain control number, structure and control of the site outside the chain of weight (quantity, quality, Nofollow , etc.), adding new content update frequency.
Crawler is based on the specified URL to download a program or script web content, general search engines are used in a distributed architecture reptiles. Distributed crawling by data centers, distributed crawling servers, distributed crawler composed by more than one data center servers crawl, crawl on each server can host multiple crawlers.
Common distributed architecture has a master-slave distributed reptiles and reptiles on the distribution of the equation .
Master-slave distributed crawlers only the equivalent of a URL assigned server, the whole of the Internet URL is assigned to several crawl server download. This architecture significantly on URL distribution server performance demanding, large face of the Internet data, it is prone to system bottlenecks.
Peer to peer distributed reptiles no URL assigned servers, each server is responsible for a particular domain name set to crawl under the URL to crawl. Internet domain diversity by modulo hash or hash consistency :
Hash modulo means for n sets crawl server, first hashed domain, the value obtained for the n modulo, get the remainder shall be allocated to the domain server number. For example, suppose there are five sets of crawl server, the corresponding number is 0 , 1 , 2 , 3 , 4 value after the domain name hash calculated 16 , 16 to 5 modulo remainder get alower, namely the domain URL should be referred to a number of servers crawl. However, this model is flawed, in a station crawl server downtime or due to URL leads to increased server load average increased need to increase the crawl server, modulo n we need to change. This means that the entire system should be re-allocated, will lead to a waste of resources.
Consistent hashing refers to hashed domain, mapped to an 0-2 32 between a number, the hash range end to end, that is considered the value 0 and 2 32 overlap, can be into a hypothetical ordered loop sequence, each server is responsible for a certain value segment, as shown below. Assuming the site domain hash fall after 2 No. server performs crawl, andtwo No. server goes down, you continue to find clockwise, the URL of the server by the first encounter, namely 3 No. server until 2 No. server back to normal.
Due to the distributed crawler structure, so the same site will be a lot different IP spiders crawling record, this collaborative crawling system is normal. Part SEOer think different IP segment means different weights of spiders, a field in which the IP is right down the spider spiders. Crawling system might URL weights were assigned, in order to determine the order crawl. However, the weight value is only used for crawling system, web crawlers ordering more complex weighting rules. So, say a spider down the right set up.
Spider browser download process is similar except that the spider only download HTML file, the file does not render the picture is not loaded, Flash , etc., under normal circumstances is not loaded JS .
Spider in the acquisition of the site will consider network load information website, according to the website of the network bandwidth to control crawl. Under normal circumstances, the load is based on IP control. Therefore, increase the bandwidth the site is beneficial SEO ‘s. Of course, if it is a shared IP ‘s website, this is difficult to control.
The download process can be divided into four steps: DNS resolve , TCP connection , server computing , HTMLdownload .
Major with DNS server performance, and the other with the analytical approach also has a certain relationship. For SEO is concerned, you can choose a professional test DNS service providers. In the case of this step is the most prone to spider operations shielded IP , because the process is very similar to a spider crawling DO Sattacks. Benniao operations where the company had appeared shielded Google spiders cause the disappearance of the event included Google, once a famous domestic IDC service providers have no intention of operations shielding Baidu Spider, led to the use of the IDC server Baidu included a large number of site conditions appear to disappear.
speed depends on the Web server can quickly access request, in general, when the server at the same time will have access to a large number of cases of congestion request information or refuse access, the greater the amount of access TCP connection speed is slower. Of course, with the selected WEB server program (such as IIS , Apache, Nginx , etc.). For large sites, the problem can be solved by increasing the configuration. Using shared server mode for small sites, to avoid Sheremetyevo and forums, download station, etc., consume large bandwidth site shared server.
speed depends on the program website architecture, the efficiency of database programming language efficiency (for dynamic content), while the amount of parallel processing and so on. In this process, the most common problem is the low efficiency database, which will result in site speed reduces errors when the page is accessed even. Benniao server where the company would often visit the same time as the emergence of excessive database error, the other person is depressed. The process is optimized for operation of the main things, either for good or change the program a little bit better engineer.
speed is mainly depends on the file size and network bandwidth (for large sites more visible), forSEO is concerned, can HTML code optimization. Many websites HTML code lengthy problem exists, and even throughout the code and the text is only a very small part of it. Benniao part of the page where the company also have this problem, had used a page function, after removing the remains of code, or a “More” button to hide all of the links of the function (embodied in HTML , the content and even exceeds the amount of content body of the page!). For HTML optimization, see the source code is a good habit.
The Total Page Library
To crawl to the site for storage, allowing SE quickly create snippets in search results, you can save SE of CPU and network resources, you can extract the information for subsequent indexing, sorting, etc. (eg extract anchor text) to provide support. And these must support large-scale data storage smooth storage, random access and sequential access, large-scale updates, which requires a lot of technical support, but this piece of content and SEO basically irrelevant.
URL Extraction Update
This step extracts caught webpage URL , and the URL to re-conduct, norms and other treatment, and URL database to communicate, do not repeat the new URL , remove the expired page URL so on. For SEO , the part has to crawl and rank the value of a page can not properly Nofollow , giving a new page, it is important to take the opportunity to crawl more pages and weights.