Google Expands Details on VisualRank – PageRank for Pictures

In April of this year (2008), at the 17th International World Wide Web Conference in Beijing, China, Google researchers presented their findings on an experiment that they performed involving a new way of indexing images which relied to some degree on the actual content of the images instead of things such as text and meta data associated with those pictures.

Our First Look at VisualRank

The paper, PageRank for Product Image Search (pdf), details the results of a series of experiments involving the retrieval of images in for 2000 of the most popular queries that Google receives for products, such as the iPod and Xbox. The authors of the paper tell us that user satisfaction and relevancy of results were significantly improved in comparison to results seen from Google’s image search.

News of this “PageRank for Pictures” or VisualRank spread quickly across many blogs including TechCrunch and Google Operating System, as well as media sources such as the New York Times and The Register from the UK.

The authors of that paper tell us that it makes three contributions to the indexing of pictures:

  1. We introduce a novel, simple, algorithm to rank images based on their visual similarities.
  2. We introduce a system to re-rank current Google image search results. In particular, we demonstrate that for a large collection of queries, reliable similarity scores among images can be derived from a comparison of their local descriptors.
  3. The scale of our experiment is the largest among the published works for content-based-image ranking of which we are aware. Basing our evaluation on the most commonly searched for object categories, we significantly improve image search results for queries that are of the most interest to a large set of people.

The process behind ranking images based upon visual similarities between them takes into account small features within the images, while adjusting for such things as differences in scale, rotation, perspective and lighting. The paper shows an illustration of 1,000 pictures of the painting the Mona Lisa, with the two largest at the center of the illustration being the highest ranked images in a query for “mona lisa”

A Second Look at VisualRank

In the conclusion to PageRank for Product Image Search, the authors noted some areas that they needed to explore further, such as how effective their system might work in real world circumstances on the Web, where mislabeled spam images might appear, as well as many duplicate and near duplicate versions of images.

A new paper from the authors takes a deeper look at the algorithms behind VisualRank, and provides some answers to the problems of spam and duplicate images – VisualRank: Applying PageRank to Large-Scale Image Search (pdf).

The new VisualRank paper also expands upon the experimentation described in the first paper, which focused upon queries for images of products, to include queries for 80 common landmarks such as the Eiffel Tower, Big Ben, and the Lincoln Memorial.

This VisualRank approach appears to still rely initially upon older methods of ranking images which look at things such as text and meta data (like alt text) associated with those images, to come up with a limited number of images to compare with each other. Once it receives those pictures in response to a query, a reranking of those images take place based upon shared features and similarities between the images.

Conclusion

Hopefully, if you have a website where you include images to help visitors experience what your pages are about in a visual manner, you’re now asking yourself how good a representation your picture is of what your page is about.

Being found for images on the web is another way that people can find your pages. And, the possibility that a search engine might include a picture from your page in search results next to your page title and description and URL is a very real one – Google has been doing it for News searches for a while.

How A Search Engine May Use Web Traffic Logs in Ranking Web Pages

By Bill Slawski

A newly granted patent from Yahoo describes how information collected from usage log files from toolbars, ISPs, and web servers can be used to rank web pages, discover new pages, move a page into a higher tier in a multi-tier search engine, increase the weight of links and the relevance of anchor text for pages based upon those weights, and determine when the last time a page has been changed or updated.

Yahoo search toolbar

When you perform a search at a search engine, and enter a query term to search with, there are a number of steps that a search engine will take before displaying a set of results to you.

One of them is to sort the results to be shown to you in an order based upon a combination of relevance and importance, or popularity.

Over the past few years, that “popularity” may have been determined by a search engine in a few different ways. One might be based upon whether or not a page is frequently selected from search results in response to a particular query.

Another might be based upon a count by a search engine crawling program of the number of links that point to a page, so that the more incoming links to a page, the more popular the page might be considered. Incoming links might even be treated differently, so that a link from a more popular page may count more than a link from a less popular page.

Problems with Click and Link Popularity

Those measures of the popularity of a page, based upon clicks in search results and links pointing to that page, are somewhat limited. It’s still possible for a page to be very popular and still be assigned a low popularity weight from a search engine.

Example

A web page is created, and doesn’t have many links pointing to it from other sites. People find the site interesting, and send emails to people they know about the site. The site gets a lot of visitors, but few links. It becomes popular, but the search engines don’t know that, based upon a low number of links to the site, and little or no clicks in search results to the page. A search engine may continue to consider the page to be one of little popularity.

Using Network Traffic Logs to Enhance Popularity Weights

Instead of just looking at those links and clicks, what if a search engine started paying attention to actual traffic to pages, measured by looking at traffic information from web browser plugins, web server logs, traffic server logs, and log files from other sources such as Internet Service Providers (ISPs)?

A good question, and it’s possible that at least one search engine has been using such information for a few years.

Yahoo was granted a patent today, originally filed in 2002, that describes how search traffic information could be used to create popularity weights for pages, and rerank search results based upon actual traffic to those pages, and be used in a number of other ways.

Here are some of them:

  • The rank of a URL in search results might be influenced by the number of times the URL shows up in network traffic logs as a measure of popularity;
  • New URLs can be discovered by a search engine when they appear in network traffic logs;
  • More popular URLs can be placed into higher level tiers of a search index, based upon the number of times the URL appears in the network traffic logs;
  • Weights can be assigned to links, where the link weights are used to determine popularity and the indexing of pages, based upon the number of times a URL is present in network traffic logs; and,
  • Whether a page has been modified since the last time a search engine index was updated can be determined by looking at the traffic logs for a last modified date or an HTTP expiration date.

The patent granted to Yahoo is:

Using network traffic logs for search enhancement
Invented by Arkady Borkovsky, Douglas M. Cook, Jean-Marc Langlois, Tomi Poutanen, and Hongyuan Zha
Assigned to Yahoo
US Patent 7,398,271
Granted July 8, 2008
Filed April 16, 2002

Abstract

A method and apparatus for using network traffic logs for search enhancement is disclosed. According to one embodiment, network usage is tracked by generating log files. These log files among other things indicate the frequency web pages are referenced and modified. These log files or information from these log files can then be used to improve document ranking, improve web crawling, determine tiers in a multi-tiered index, determine where to insert a document in a multi-tiered index, determine link weights, and update a search engine index.

Network Usage Logs Improve Ranking Accuracy

The information contained in network usage logs can indicate how a network is actually being used, with popular web pages shown as being viewed more frequently than other web pages.

This popularity count could be used by itself to rank a page, or it could be combined with an older measure that uses such things as links pointing to the page, and clicks in search results.

Instead of looking at all traffic information for a page, visits over a fixed period of time may be counted, or new page views may be considered to be worth more than old page views.

Better Web Crawling

Usually a search engine crawling program discovers new pages to index by finding links to pages on the pages that they crawl. The crawling program may not easily find sites that don’t have many links pointing to them.

But, pages that show up in log files from ISPs or toolbars could be added to the queue of pages to be crawled by a search engine spider

Pages that don’t have many links to them, but show up frequently in log information may even be promoted for faster processing by a search crawler.

Multi-Tiered Search Indexes

It’s not unusual for a search engine to have more than one tier of indexes, with a relatively small first-tier index which includes the most popular documents. Lower tiers get relatively larger, and have relatively less popular documents included within them.

A search query would normally be run against the top level tier first, and if not enough results for a query are found in the first tier, the search engine might run the query against the next level of tiers of the index.

Network usage logs could be used to determine which tier of a multi-tier index should hold a particular page. For instance, a page in the second-tier index could be moved up to the first-tier index if its URL shows up with a high frequency in usage logs. More factors than frequency of a URL in a usage log could be used to determine which tier to assign a document.

Usage Logs for Link Weights

One use search engines have for link information is to determine the popularity of a document,

The number of incoming links to a page may be used to determine the popularity of that page.

A weight may also be assigned based upon the relationship between words used in a link and the documents being linked to with that link. If there is a strong logical tie between a page and a word, then the relationship between the word and the page is given a relatively higher weight than if there wasn’t. This is known as a “correlation weight.” The word “zebra” used in the anchor text of a link would have a high correlation weight if the article it points to is about zebras. If the article is about automobiles, it would have a much lower correlation weight.

Links could aso be assigned weights (“link weights”) based on looking at usage logs to see which links were selected to request a page. As the patent’s authors tell us:

Thus, those links that are frequently selected may be given a higher link weight than those links that are less frequently selected even when the links are to the same document.

In other words, pages pointed to by frequently followed links could be assigned higher popularity values than pages with more incoming links that are rarely followed.

Link weights Used to Determine the Relevance of Pages for Anchor Text

If a word pointing to a page is in a link (as anchor text), and the link is one that is frequently followed, then the relevance of that page for the word in the anchor text may be increased in the search engine’s index.

For example, assume that a link to a document has the word “zebra”, and another link to the same document has the word “engine”. If the “zebra” link is rarely followed, then the fact that “zebra” is in a link to the document should not significantly increase the correlation weight between the word and the document. On the other hand, if the “engine” link is frequently followed, the fact that the word “engine” is in a frequently followed link to the document may be used to significantly increase the correlation weight between the word “engine” and the document.

Conclusion

This patent was originally filed back in 2002, and some of the processes it covers are also discussed in more recent patent filings and papers from the search engines, such as popularity information being used to determine which tier a page might be on in a multi-tier search engines.

Some of the processes it describes have been assumed by many to be processes that a search engine uses, such as discovering new pages from information gathered by search engine toolbars.

A few of the processes described haven’t been discussed much, if at all, such as the weight of a link (and the relevance of anchor text in that link) being increased if it is a frequently used link, and decreased if it isn’t used often.

It’s possible that some of the processes described in this patent haven’t been used by a search engine, but it does appear that search engines are paying more and more attention to user information that they do collect from places like toolbars and log files from different sources. This patent is one of the earliest from a major search engine that describes how such user data could be used in a fair amount of detail.

Another patent from Yahoo was also granted this week on How Anchor Text can be used to determine the relevancy of a page for specific words. I’ve written about that over on SEO by the Sea, in Yahoo Patents Anchor Text Relevance in Search Indexing

Faces and Landmarks: Two Steps Towards Smarter Image Searches

By Bill Slawski

There’s an old saying that goes, “A picture’s worth a thousand words.” The right image on a web page can communicate ideas that words may only begin to capture.

An image in a news article may transport a viewer into the middle of the story. A couple of sharp images, from different angles, may inspire someone to buy something online that they might have only purchased offline previously, like shoes or clothes. A portrait of a writer or a business owner or a researcher may bring an increased level of credibility and trust to a web site.

Search Engines and Images

All of the major search engines allow us to search for images in image search web databases. The search engines have also started blending images into their regular Web search results, to add color and diversity to search results, as well as providing a possible way of illustrating different concepts that might be related to a query term with those pictures.

A picture next to a news result may provide context for the news story very quickly, like in the Google search result below:

Google Search for Hulk

While search engines index pages and pictures and videos and a host of other objects that they find on the web, their approach to helping us find images has relied upon text, and upon matching keywords that we enter into a search box. A search engine normally indexes images based upon words that appear on the same pages as pictures, in alternative text associated with the images, or in captions for the pictures, or in text that appears in the address, or URL, for the page, or in the words within links to the photo or page where that picture appears.

That reliance upon the words associated with images to index and rank pictures may be changing. Google recently released a paper about PageRank for Product Image Search that looks at similarities within the images themselves to rank pictures in a search. Microsoft just published a patent application on ranking images that looked at nontextual signals about images, such as the number of links pointing to the pictures, how frequently a picture appeared upon a site, sizes and the quality of the pictures, to help rank those images.

An Image from Google Street Views

A Google patent application from January described ways that a search engine might read text in images, including the words and signs it sees while collecting pictures for its Street Views project for Google Maps. The picture to the right shows the locations of text in a Street Views image that Google could use in its index.

Search engines are getting smarter about how they view, index, and rank images and site owners should probably consider getting smarter about the images that they use on their pages to illustrate what they have to offer.

Making Room for Images in Search

What if we could send a picture to a search engine, and have it return related pictures back, or news stories, or web pages? In an article on the New York Times a couple of years back, The Route From Research to Start-Up, the founder of Nevenengineering described one of the technologies that he was working upon:

Ultimately, the technology “will allow you to point your camera phone at a movie poster or a restaurant and get an immediate review of the film or the fare on your cellphone, which will tap into databases,” said Mr. Neven, who foresees one billion camera phones in use worldwide by 2010.

Imagine snapping a photo, and having a search engine provide you with information about the subject of that picture.

Google acquired Mr. Neven’s startup a couple of years ago, and in the Official Google blog, they told us that one use of the technologies transferred in the acquisition would be A better way to organize photos?
Having software that could look at your photo collection, and index and organize your images based upon what it sees in the pictures themselves is pretty amazing.

But the image recognition technology from Nevenengineering could do more than sort photos. It could also be used to search for information related to images.

And before the company developed a consumer related product, it started out as a biometrics company, providing technology for law enforcement and the military. A presentation on one of their technologies, SIMBA: Single Image Multi-Biometric Analysis (pdf), provides an idea of some of what the company has been capable of when it comes to recognizing faces and associating them with people. And the technology is capable of performing facial recognition in videos as well as still images.

Faces First, Other Image Features Later?

Google doesn’t offer the ability to search based upon images that you upload to the search engine. At least, they don’t yet. But, it appears that they may have a start on technology that could make the possibility into a reality at some point.

Last year, a post on the Google Operating System blog pointed out a way to Restrict Google Image Results to Faces, News by adding a string of text at the end of the addresses, or URLs, for each of those types of searches.

A patent application published by Google recently described how the search engine can take facial images that it has associated with specific peoples’ names that contain metadata about the identify of those people, and use those pictures to build a statistical model of their faces.

That statistical model could then be used to associate the peoples’ names with other images that don’t contain metadata such as alternative text in alt tags, or captions, or text upon the same pages. The patent application is:

Identifying Images Using Face Recognition
Invented by Jay Yagnik
Assigned to Google
US Patent Application 20080130960
Published June 5, 2008
Filed December 1, 2006

Abstract

A method includes identifying a named entity, retrieving images associated with the named entity, and using a face detection algorithm to perform face detection on the retrieved images to detect faces in the retrieved images. At least one representative face image from the retrieved images is identified, and the representative face image is used to identify one or more additional images representing the at least one named entity.

It makes sense for Google to try to focus upon faces first, before tackling other aspects of indexing images based upon the content of those pictures. If Google can master the indexing of images that it finds upon the Web that don’t have text or metadata associated with them, that may bring the search engine a step closer to being able to provide search results for images uploaded to Google by a searcher.

Breaking the problem of indexing and searching images to one aspect of images, such as facial recognition, could allow the search engine to address image searching in incremental steps. Choosing facial images as a first step in developing a smarter image search technology does have some issues associated with it, especially from a privacy stance. Allowing people to upload images of faces, to search upon those may raise a number of privacy issues that a search engine may not want to address.

Meanwhile, Yahoo Looks at Landmarks

Another approach to indexing and ranking images is going on at Yahoo, in a Flickr related project that takes images that have been tagged with geographic terms and locations, and tries to cluster together images that are similar based upon locations identified in those tags. The tags associated with images include both user created annotations, and automatic annotations from “location-aware cameraphones and GPS integrated cameras.”

Using automatically generated location data, and software that can cluster together similar images to learn about images again goes beyond just looking at the words associated with pictures to learn what they are about.

Flickr Cluster for San Francisco

The narrow focus of this project again allows for the development of a smarter image search technology in an incremental approach – associated with well known locations. It’s possible that this choice of topics won’t raise the number of privacy concerns that Google’s focus upon faces may.

Conclusion

Approaches from search engines to indexing and ranking images may soon be incorporating technologies that move them away from a strict reliance upon text that appears on the same pages as the pictures, if they aren’t already.

Images are being shown in Web search results in increasing numbers, so changes like this happening in an emerging area of search should be something to keep a careful eye upon.

Images on a web site can help illustrate the ideas and concepts on web pages in a way that words alone can’t. If the pictures can capture the essence of a concept or query through the use of text associated with the pictures on those pages, and even in the absence of such text, they may start appearing in blended search results at one of the major search engines.

Using facial recognition technology, or clustering images around landmarks based upon geographical tags and similarities in pictures are just two steps towards the development of image search technology on the web that relies less upon words, and more upon what is captured in those images.

The right picture on a web page may become not only a way to illustrate the ideas being presented on that page, but also a way for people to find that page based upon the content of the image rather than just the words that surround it.

Yahoo on Web Mining and Improving the Quality of Web Sites

by Bill Slawski

A successful web site is one that fulfills the objectives of its owners and meets the expectations of the visitors that it was created to serve.

This is true of ecommerce web sites, news and informational sites, personal web pages, and even search engines. And, it’s a topic that even the search engines are exploring more deeply. A recent patent application from Yahoo tells us that:

The Web has been characterized by its rapid growth, massive usage, and its ability to facilitate business transactions. This has created an increasing interest for improving and optimizing websites to fit better the needs of their visitors. It is more important than ever for a website to be found easily on the Web and for visitors to reach effortlessly the content for which they are searching. Failing to meet these goals can mean the difference between success and failure on the Internet.

User Query Data Mining and Related Techniques, (US Patent Application 20080065631), by Ricardo Alberto Baeza-Yates and Barbara Poblete.

The patent filing discusses how information about queries that people use, collected from search boxes on a site (if one is used) and from search engines bringing people to a site, can provide useful and helpful information about how people use that site.

The collection of this kind of information is often referred to as Web Mining, and looking closely at the words people use to find information on a site can tell us something about the actual information needs of those visitors.

Search engines have studied searchers’ queries mostly to try to make search engines work better, but looking at the words people use to find a site, and to search within it once they have found it, could help to make the web sites themselves better.

The abstract of Yahoo’s patent filing notes:

Methods and apparatus are described for mining user queries found within the access logs of a website and for relating this information to the website’s overall usage, structure, and content. Such techniques may be used to discover valuable information to improve the quality of the website, allowing the website to become more intuitive and adequate for the needs of its users.

One tool that many site owners use on their pages are analytics programs, though often those are looked at to see how much traffic is coming to a site, and possibly to determine which words people are using to find a site. Analytics programs can provide a stronger role in helping people with web sites improve the experience of people visiting their pages, and the success of their sites.

Web Mining

The Yahoo patent is interesting in that it focuses less on how a search engine works, and more on how the owners of web sites can use the process of Web mining to discover patterns and relations in Web data. Web mining can be broken down into three main areas:

  • Content mining,
  • Structure mining, and;
  • Usage mining.

These relate to three kinds of data that can be found on a web site:

  • Content — the information that a web site provides to visitors such as the text and images and possibly video and audio, that people see when they come to a site.
  • Structure data — this is information about how content is organized on a site, such as the links between pages, the organization of information on pages, the organization of the pages of the site itself, and the links to pages outside of the site.
  • Usage data — this information describes how people actually use the site, and may be reflected in the access log files of the server that the site is on, as well as data collected from specific applications on the site, such as people signing up for newsletters or registering with a site and using it in different ways.

Knowing which pages people visit and which pages people don’t can be helpful in figuring out if there are problems with a site. They can uncover a need to rewrite pages, or to reorganize links, or make other changes.

Mining User Queries

Understanding query terms used to find a site and to search on the site can help improve the overall quality of a site. Yahoo’s approach would be to create a model to use to understand how people are accessing a site, and navigating through it:

According to specific embodiments of the invention, a model is provided for mining user queries found within the access logs of a website, and for relating this information to the website’s overall usage, structure, and content. The aim of this model is to discover valuable information which may be used to improve the quality of the website, thereby allowing the website to become more intuitive and adequate for the needs of its users.

This model presents a methodology of analysis and classification of different types of queries registered in the usage logs of a website, including both queries submitted by users to the website’s internal search engine and queries from global search engines that lead to documents on the website. As will be shown, these queries provide useful information about topics that interest users visiting the website. In addition, the navigation patterns associated with these queries indicate whether or not the documents in the site satisfied the user’s needs.

Queries uncovered might be related to categories drawn from such things as navigational information found on a site.

Traffic through the site could tell someone using this invention how effective the site was at meeting the information needs of the people using certain queries. It could also provide suggestions for:

  1. The addition of new content
  2. Changes or additions in words found in anchor text in links
  3. New links between related documents
  4. Revisions to links between unrelated documents

Information Scent

Visitors to a site will follow links that use words within the links that provide some level of confidence that the information being looked for will be upon the other side of those links (The Right Trigger Words as User Interface Engineering’s Jared Spool calls them). Likewise, when someone searches at a search engine, and sees a page title and a snippet of text for a site in search results, the words used in the title and snippet may persuade someone to visit the page. This is true both for search results from a search engine, and search results from an internal search for a specific site.

Understanding what kind of information is being searched for regarding a specific query, and how the words used in search results, on web pages, and in links to other pages may provide some insight into making those search results, those pages, and that anchor text better.

The patent application describes how pages and the queries used to reach them can be classified based upon how they are typically used by a visitor – from external searches through a search engine, from internal searches through a web site search, or through navigation on the site itself.

It also classifies queries as successful or unsuccessful, based upon things such as whether someone visited a page in response to the display of a search result showing the page, or if they followed other links on pages visited to explore a site in more depth.

Seeing how pages are typically reached on a site in response to certain queries, and seeing which queries are successful and unsuccessful in bringing people to information that they want to find can help a site owner make positive changes to a site.

Example

The patent application provides an example using a portal targeted at university students and future applicants.

It focuses upon exploring how effective the site is when searchers use the queries “university admission test” and “new student application” in searches for the site both on search engines and on a site search for the site. Two initial reports evaluated how effective the site was without making any changes. Twenty of the top suggestions generated from reviewing the model described in this patent application were incorporated into the site’s content and structure:

The suggested improvements were made mainly to the top pages of the site and included adding Information Scent to link descriptions, adding new relevant links, and suggestions extracted from frequent query patterns, and class A and B queries.

Other improvements included broadening the content on certain topics using class C queries, and adding new content to the site using class D queries. For example the site was improved to include more admission test examples, admission test scores, and more detailed information on scholarships, because these were issues consistently showing up in class C and D queries.

The “class C” queries mentioned are ones where there was very little information available on the pages of the web site. The “class D” queries were ones for which there was no information available on the site.

One significant result of these changes showed an increase in traffic from external search engines of more than 20%, due to improvements in content, and in link text.

Conclusion

It’s interesting that a search engine would apply for a patent that explores how to use data mining to improve the quality, content, and navigation of a web site. It’s difficult to tell what Yahoo might do with the method describe in this patent application – whether they will only use it internally, or will offer it to others for a fee, or for free.

Many of the concepts described in this patent application are ones that site owners can presently use to improve how well their site meets their objectives, and the objectives of people visiting their pages.

Understanding the terms that people will try to use to find your pages, and the words and concepts that they expect to see on the pages of your site can make a difference in how successful your site may be.

Using analytics tools to understand how visitors who use certain queries will explore your pages and navigate from one page to another can provide even more value to both searcher and site owner, by pointing out changes that can be made to improve the experience of those visitors.

And those changes may just lead to more visits from search engines.

Advances in Crawling the Web

By Bill Slawski

There are 3 major parts to what a search engine does.

The first is crawling the Web to revisit old pages and find new pages. The second is taking those crawled pages, and extracting data from them to index. The third part is presenting pages to searchers in response to search queries.

There’s been some interesting research published recently on the first of those parts.

Crawling Challenges

Crawling the Web to discover new pages, and identify changes in pages that a search engine already knows about can be a challenge for a search engine. The major issues that search engines face in crawling sites involve:

  • How many pages they can crawl without becoming bogged down,
  • How quickly they can crawl pages without overwhelming the sites that they visit, and;
  • How much resources do they have to use to crawl and then revisit pages.

A search engine needs to be careful on how it spends its time crawling web pages, and choosing which pages to crawl, to keep these issues under control.

A recently published academic paper describes this important aspect of how a search engine works, the Web crawl, in more detail than most papers that have been published on the subject before.

Enter IRLBot at Texas A&M

The Department of Computer Science at Texas A&M University has been running a long term research project know as IRLBot which “investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web.”

In April, researchers from the school will be presenting some of their recent reseach in Beijing, China, at the 17th International World Wide Web Conference (WWW2008).

The title of their presentation is IRLbot: Scaling to 6 Billion Pages and Beyond (pdf), and the focus of the paper is this primary function that a search engine performs – crawling the Web and finding new Web pages.

Their research describes some interesting approaches to finding new pages on the Web, handling web sites with millions of pages, while also avoiding spam pages and infinite loops that could pose problems to web crawlers.

In a recent experiment that they performed that lasted 41 days, their crawler “IRLbot” ran on a single server and “successfully crawled 6.3 billion web pages at an average download time of approximately 1,789 pages per second.” This is a pretty impressive feat, and it’s even more impressive because of some of the obstacles faced while finding those pages.

Problems Facing Crawling Programs

Politeness

One challenge that faces Web crawling programs is that those programs shouldn’t ask for too many pages from the same site at a time, or they could use up too many resources of the site and make the site inoperable. Keeping that from happening is known as politeness, and search crawlers that aren’t polite often find themselves blocked by site owners, or complained about to the internet service provider hosting the crawler.

URL Management

As a crawling program indexes a site, it needs to pay attention to a file on the site known as a robots.txt file, which provides directions on pages and directories that the crawling program shouldn’t visit, so that it doesn’t crawl pages that it isn’t supposed to see. The program also needs to track which pages it has seen, so that it doesn’t try to crawl the same pages over and over again.

Avoiding Spam While Crawling

The crawling process described in the paper also tried to limit the crawling program from accessing pages that might more likely be spam pages. If a crawling program spends a lot of its time on spam pages and link farms, it has less time to spend on sites that may be more valuable to people searching for good results to their queries at search engines.

One key to the method used by this research team in determining how much attention a site should get from their web crawler was in looking at the number of legitimate links into the site there are from other sites, which is what they refer to as domain reputation.

Why This Paper is Important

The authors of the paper tell us that there are:

…only a limited number of papers describing detailed web-crawler algorithms and offering their experimental performance.

The paper provides a lot of details on the crawling process and the steps that the Texas A&M researchers took that enabled them to index multi-million page web sites, avoid spam pages, and remain “polite” while doing so. It explores the experiments that they conducted to test out ideas on how to handle very large sites, and crawls of very large amounts of pages.

They conducted their experiment using only a single computer. The major commercial search engines have considerably more resources to spend on crawling the web, but the issues involving managing which pages they choose to index, being polite to sites that they visit, and avoiding spam pages are problems that commercial search engines face too.

Learning about how search engines may crawl pages can help us understand how a search engine might treat individual sites during that process. If you are interested in learning about the web crawling process in depth, this paper is a good one to spend some time reading.