By Bill Slawski
There are 3 major parts to what a search engine does.
The first is crawling the Web to revisit old pages and find new pages. The second is taking those crawled pages, and extracting data from them to index. The third part is presenting pages to searchers in response to search queries.
There’s been some interesting research published recently on the first of those parts.
Crawling the Web to discover new pages, and identify changes in pages that a search engine already knows about can be a challenge for a search engine. The major issues that search engines face in crawling sites involve:
- How many pages they can crawl without becoming bogged down,
- How quickly they can crawl pages without overwhelming the sites that they visit, and;
- How much resources do they have to use to crawl and then revisit pages.
A search engine needs to be careful on how it spends its time crawling web pages, and choosing which pages to crawl, to keep these issues under control.
A recently published academic paper describes this important aspect of how a search engine works, the Web crawl, in more detail than most papers that have been published on the subject before.
Enter IRLBot at Texas A&M
The Department of Computer Science at Texas A&M University has been running a long term research project know as IRLBot which “investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web.”
In April, researchers from the school will be presenting some of their recent reseach in Beijing, China, at the 17th International World Wide Web Conference (WWW2008).
The title of their presentation is IRLbot: Scaling to 6 Billion Pages and Beyond (pdf), and the focus of the paper is this primary function that a search engine performs – crawling the Web and finding new Web pages.
Their research describes some interesting approaches to finding new pages on the Web, handling web sites with millions of pages, while also avoiding spam pages and infinite loops that could pose problems to web crawlers.
In a recent experiment that they performed that lasted 41 days, their crawler “IRLbot” ran on a single server and “successfully crawled 6.3 billion web pages at an average download time of approximately 1,789 pages per second.” This is a pretty impressive feat, and it’s even more impressive because of some of the obstacles faced while finding those pages.
Problems Facing Crawling Programs
One challenge that faces Web crawling programs is that those programs shouldn’t ask for too many pages from the same site at a time, or they could use up too many resources of the site and make the site inoperable. Keeping that from happening is known as politeness, and search crawlers that aren’t polite often find themselves blocked by site owners, or complained about to the internet service provider hosting the crawler.
As a crawling program indexes a site, it needs to pay attention to a file on the site known as a robots.txt file, which provides directions on pages and directories that the crawling program shouldn’t visit, so that it doesn’t crawl pages that it isn’t supposed to see. The program also needs to track which pages it has seen, so that it doesn’t try to crawl the same pages over and over again.
Avoiding Spam While Crawling
The crawling process described in the paper also tried to limit the crawling program from accessing pages that might more likely be spam pages. If a crawling program spends a lot of its time on spam pages and link farms, it has less time to spend on sites that may be more valuable to people searching for good results to their queries at search engines.
One key to the method used by this research team in determining how much attention a site should get from their web crawler was in looking at the number of legitimate links into the site there are from other sites, which is what they refer to as domain reputation.
Why This Paper is Important
The authors of the paper tell us that there are:
…only a limited number of papers describing detailed web-crawler algorithms and offering their experimental performance.
The paper provides a lot of details on the crawling process and the steps that the Texas A&M researchers took that enabled them to index multi-million page web sites, avoid spam pages, and remain “polite” while doing so. It explores the experiments that they conducted to test out ideas on how to handle very large sites, and crawls of very large amounts of pages.
They conducted their experiment using only a single computer. The major commercial search engines have considerably more resources to spend on crawling the web, but the issues involving managing which pages they choose to index, being polite to sites that they visit, and avoiding spam pages are problems that commercial search engines face too.
Learning about how search engines may crawl pages can help us understand how a search engine might treat individual sites during that process. If you are interested in learning about the web crawling process in depth, this paper is a good one to spend some time reading.