Yahoo on Web Mining and Improving the Quality of Web Sites

by Bill Slawski

A successful web site is one that fulfills the objectives of its owners and meets the expectations of the visitors that it was created to serve.

This is true of ecommerce web sites, news and informational sites, personal web pages, and even search engines. And, it’s a topic that even the search engines are exploring more deeply. A recent patent application from Yahoo tells us that:

The Web has been characterized by its rapid growth, massive usage, and its ability to facilitate business transactions. This has created an increasing interest for improving and optimizing websites to fit better the needs of their visitors. It is more important than ever for a website to be found easily on the Web and for visitors to reach effortlessly the content for which they are searching. Failing to meet these goals can mean the difference between success and failure on the Internet.

User Query Data Mining and Related Techniques, (US Patent Application 20080065631), by Ricardo Alberto Baeza-Yates and Barbara Poblete.

The patent filing discusses how information about queries that people use, collected from search boxes on a site (if one is used) and from search engines bringing people to a site, can provide useful and helpful information about how people use that site.

The collection of this kind of information is often referred to as Web Mining, and looking closely at the words people use to find information on a site can tell us something about the actual information needs of those visitors.

Search engines have studied searchers’ queries mostly to try to make search engines work better, but looking at the words people use to find a site, and to search within it once they have found it, could help to make the web sites themselves better.

The abstract of Yahoo’s patent filing notes:

Methods and apparatus are described for mining user queries found within the access logs of a website and for relating this information to the website’s overall usage, structure, and content. Such techniques may be used to discover valuable information to improve the quality of the website, allowing the website to become more intuitive and adequate for the needs of its users.

One tool that many site owners use on their pages are analytics programs, though often those are looked at to see how much traffic is coming to a site, and possibly to determine which words people are using to find a site. Analytics programs can provide a stronger role in helping people with web sites improve the experience of people visiting their pages, and the success of their sites.

Web Mining

The Yahoo patent is interesting in that it focuses less on how a search engine works, and more on how the owners of web sites can use the process of Web mining to discover patterns and relations in Web data. Web mining can be broken down into three main areas:

  • Content mining,
  • Structure mining, and;
  • Usage mining.

These relate to three kinds of data that can be found on a web site:

  • Content — the information that a web site provides to visitors such as the text and images and possibly video and audio, that people see when they come to a site.
  • Structure data — this is information about how content is organized on a site, such as the links between pages, the organization of information on pages, the organization of the pages of the site itself, and the links to pages outside of the site.
  • Usage data — this information describes how people actually use the site, and may be reflected in the access log files of the server that the site is on, as well as data collected from specific applications on the site, such as people signing up for newsletters or registering with a site and using it in different ways.

Knowing which pages people visit and which pages people don’t can be helpful in figuring out if there are problems with a site. They can uncover a need to rewrite pages, or to reorganize links, or make other changes.

Mining User Queries

Understanding query terms used to find a site and to search on the site can help improve the overall quality of a site. Yahoo’s approach would be to create a model to use to understand how people are accessing a site, and navigating through it:

According to specific embodiments of the invention, a model is provided for mining user queries found within the access logs of a website, and for relating this information to the website’s overall usage, structure, and content. The aim of this model is to discover valuable information which may be used to improve the quality of the website, thereby allowing the website to become more intuitive and adequate for the needs of its users.

This model presents a methodology of analysis and classification of different types of queries registered in the usage logs of a website, including both queries submitted by users to the website’s internal search engine and queries from global search engines that lead to documents on the website. As will be shown, these queries provide useful information about topics that interest users visiting the website. In addition, the navigation patterns associated with these queries indicate whether or not the documents in the site satisfied the user’s needs.

Queries uncovered might be related to categories drawn from such things as navigational information found on a site.

Traffic through the site could tell someone using this invention how effective the site was at meeting the information needs of the people using certain queries. It could also provide suggestions for:

  1. The addition of new content
  2. Changes or additions in words found in anchor text in links
  3. New links between related documents
  4. Revisions to links between unrelated documents

Information Scent

Visitors to a site will follow links that use words within the links that provide some level of confidence that the information being looked for will be upon the other side of those links (The Right Trigger Words as User Interface Engineering’s Jared Spool calls them). Likewise, when someone searches at a search engine, and sees a page title and a snippet of text for a site in search results, the words used in the title and snippet may persuade someone to visit the page. This is true both for search results from a search engine, and search results from an internal search for a specific site.

Understanding what kind of information is being searched for regarding a specific query, and how the words used in search results, on web pages, and in links to other pages may provide some insight into making those search results, those pages, and that anchor text better.

The patent application describes how pages and the queries used to reach them can be classified based upon how they are typically used by a visitor – from external searches through a search engine, from internal searches through a web site search, or through navigation on the site itself.

It also classifies queries as successful or unsuccessful, based upon things such as whether someone visited a page in response to the display of a search result showing the page, or if they followed other links on pages visited to explore a site in more depth.

Seeing how pages are typically reached on a site in response to certain queries, and seeing which queries are successful and unsuccessful in bringing people to information that they want to find can help a site owner make positive changes to a site.


The patent application provides an example using a portal targeted at university students and future applicants.

It focuses upon exploring how effective the site is when searchers use the queries “university admission test” and “new student application” in searches for the site both on search engines and on a site search for the site. Two initial reports evaluated how effective the site was without making any changes. Twenty of the top suggestions generated from reviewing the model described in this patent application were incorporated into the site’s content and structure:

The suggested improvements were made mainly to the top pages of the site and included adding Information Scent to link descriptions, adding new relevant links, and suggestions extracted from frequent query patterns, and class A and B queries.

Other improvements included broadening the content on certain topics using class C queries, and adding new content to the site using class D queries. For example the site was improved to include more admission test examples, admission test scores, and more detailed information on scholarships, because these were issues consistently showing up in class C and D queries.

The “class C” queries mentioned are ones where there was very little information available on the pages of the web site. The “class D” queries were ones for which there was no information available on the site.

One significant result of these changes showed an increase in traffic from external search engines of more than 20%, due to improvements in content, and in link text.


It’s interesting that a search engine would apply for a patent that explores how to use data mining to improve the quality, content, and navigation of a web site. It’s difficult to tell what Yahoo might do with the method describe in this patent application – whether they will only use it internally, or will offer it to others for a fee, or for free.

Many of the concepts described in this patent application are ones that site owners can presently use to improve how well their site meets their objectives, and the objectives of people visiting their pages.

Understanding the terms that people will try to use to find your pages, and the words and concepts that they expect to see on the pages of your site can make a difference in how successful your site may be.

Using analytics tools to understand how visitors who use certain queries will explore your pages and navigate from one page to another can provide even more value to both searcher and site owner, by pointing out changes that can be made to improve the experience of those visitors.

And those changes may just lead to more visits from search engines.

Share and Enjoy:
  • Print
  • Digg
  • StumbleUpon
  • Facebook
  • Yahoo! Buzz
  • Twitter
  • Google Bookmarks

Leave a Reply

Your email address will not be published. Required fields are marked *

To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image