Advances in Crawling the Web

By Bill Slawski

There are 3 major parts to what a search engine does.

The first is crawling the Web to revisit old pages and find new pages. The second is taking those crawled pages, and extracting data from them to index. The third part is presenting pages to searchers in response to search queries.

There’s been some interesting research published recently on the first of those parts.

Crawling Challenges

Crawling the Web to discover new pages, and identify changes in pages that a search engine already knows about can be a challenge for a search engine. The major issues that search engines face in crawling sites involve:

  • How many pages they can crawl without becoming bogged down,
  • How quickly they can crawl pages without overwhelming the sites that they visit, and;
  • How much resources do they have to use to crawl and then revisit pages.

A search engine needs to be careful on how it spends its time crawling web pages, and choosing which pages to crawl, to keep these issues under control.

A recently published academic paper describes this important aspect of how a search engine works, the Web crawl, in more detail than most papers that have been published on the subject before.

Enter IRLBot at Texas A&M

The Department of Computer Science at Texas A&M University has been running a long term research project know as IRLBot which “investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web.”

In April, researchers from the school will be presenting some of their recent reseach in Beijing, China, at the 17th International World Wide Web Conference (WWW2008).

The title of their presentation is IRLbot: Scaling to 6 Billion Pages and Beyond (pdf), and the focus of the paper is this primary function that a search engine performs – crawling the Web and finding new Web pages.

Their research describes some interesting approaches to finding new pages on the Web, handling web sites with millions of pages, while also avoiding spam pages and infinite loops that could pose problems to web crawlers.

In a recent experiment that they performed that lasted 41 days, their crawler “IRLbot” ran on a single server and “successfully crawled 6.3 billion web pages at an average download time of approximately 1,789 pages per second.” This is a pretty impressive feat, and it’s even more impressive because of some of the obstacles faced while finding those pages.

Problems Facing Crawling Programs

Politeness

One challenge that faces Web crawling programs is that those programs shouldn’t ask for too many pages from the same site at a time, or they could use up too many resources of the site and make the site inoperable. Keeping that from happening is known as politeness, and search crawlers that aren’t polite often find themselves blocked by site owners, or complained about to the internet service provider hosting the crawler.

URL Management

As a crawling program indexes a site, it needs to pay attention to a file on the site known as a robots.txt file, which provides directions on pages and directories that the crawling program shouldn’t visit, so that it doesn’t crawl pages that it isn’t supposed to see. The program also needs to track which pages it has seen, so that it doesn’t try to crawl the same pages over and over again.

Avoiding Spam While Crawling

The crawling process described in the paper also tried to limit the crawling program from accessing pages that might more likely be spam pages. If a crawling program spends a lot of its time on spam pages and link farms, it has less time to spend on sites that may be more valuable to people searching for good results to their queries at search engines.

One key to the method used by this research team in determining how much attention a site should get from their web crawler was in looking at the number of legitimate links into the site there are from other sites, which is what they refer to as domain reputation.

Why This Paper is Important

The authors of the paper tell us that there are:

…only a limited number of papers describing detailed web-crawler algorithms and offering their experimental performance.

The paper provides a lot of details on the crawling process and the steps that the Texas A&M researchers took that enabled them to index multi-million page web sites, avoid spam pages, and remain “polite” while doing so. It explores the experiments that they conducted to test out ideas on how to handle very large sites, and crawls of very large amounts of pages.

They conducted their experiment using only a single computer. The major commercial search engines have considerably more resources to spend on crawling the web, but the issues involving managing which pages they choose to index, being polite to sites that they visit, and avoiding spam pages are problems that commercial search engines face too.

Learning about how search engines may crawl pages can help us understand how a search engine might treat individual sites during that process. If you are interested in learning about the web crawling process in depth, this paper is a good one to spend some time reading.

Share and Enjoy:
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Yahoo! Buzz
  • Twitter
  • Google Bookmarks

4 thoughts on “Advances in Crawling the Web

  1. I built myself a simple crawler and learnt most of this along the way… The problems you com up against are really interesting.
    I like the politeness one….for my first attempt I took down my own site by using multi-curl….ha ha ahh the lessons we learn.

    Nice article, thanks….

  2. You’re welcome, James.

    Funny, taking down your own site. Guess it was a good thing that it wasn’t someone elses. Thanks for sharing that story. I like the old usenet posts which talk about the early days of crawlers, and people trying to figure out why others were spending so much time grabbing pages from their sites.

    There are a lot of elements to what makes a search engine work that go on behind the scenes, so it’s great when someone provides as much depth and detail as the Texas A&M researchers did in their paper.

  3. Hi, Lee – Google places a higher priority on websites which have more inbound links and which have a longer reputation with them.

    It sounds like you may’ve just launched your site, in which case you have some groundwork you need to do to get on Google’s priority list.

    Imagine if you were launching a new business storefront – customers won’t come during the first few days if you don’t promote it.

    So, what have you been doing to promote your site? Have you issued any press releases? Invited any bloggers to drop by and comment? Have you provided any incentives like signups for free giveaways if someone visits your site and signs up? Hired a firm to promote you?

    Endsum: You will need some inbound links. Establishing a brand new site is one of the hardest of online marketing tasks.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image