The Associated Press’s News Microformat

The Associated Press (AP) recently announced a semantic markup standard they’d like to see adopted online for news articles – the “hNews Microformat“. The proposed microformat was announced simultaneously with their declaration of a news registry system to facilitate protection and paid licensing arrangements for quoting and using news article material. While the overall announcement and news registry system was widely ridiculed in the blogosphere (in part because of a confusingly inaccurate description which stated that the microformat would serve as a “wrapper” for news articles, and the overall business model and protection scheme seems both naively optimistic and out-of-touch with copyright “fair use” standards and actual technological constraints), but the hNews microformat part itself could potentially gain some traction.

So, if you’re an online marketer of a site which publishes large amounts of articles and news stories, is the hNews microformat worth adopting to improve your online optimizations?

AP Protect, Point & Pay Diagram
(AP's Diagram Illustrating "Protect, Point & Pay" System & hNews Microformat)

I’ve long been a proponent of incorporating microformats within webpages as a component of overall good usability and potentially valuable formatting for search engine optimization purposes. Microformats can provide some additional, enhanced usability for advanced users who are using devices which can read the information and store it for future use, and they can potentially improve search engines’ ability to understand the content within webpages which could lend a marginal increment more SEO value.

Both Yahoo! and Google have been sending signals for the past few years that they consider some of the microformats to be potentially useful as well. They’ve both marked up their own local search results with hCard microformatting for end users’ benefit, and they’re both starting to make use of microformatting to give certain types of data special treatment. In the case of Google, they announced that they’d begin displaying some microformat data with slightly different listing layouts in the search results, a treatment that they’ve dubbed “Rich Snippets”. And, they say they’ll be rolling out more treatments based on microformats in the future.

With this background in mind, it’s not surprising that the AP has jumped on the microformats bandwagon, but it also appears that they’re trying to influence the development of them where news articles are concerned, with a major agenda in mind. They wish to include some sort of webbug in each news story’s markup, so that publishers of the content can be tracked more easily by them – it will be clearer when sites are reprinting news stories, and how frequently those stories are visited and viewed by consumers online.

Other portions of the hNews microformat appear to be more useful from both a search engine viewpoint and publisher site aspect. Labelling of items including keyword tags, headlines, main content, geographic locations and including author’s vcard info all appear to be valuable standards.

(I could really criticize their “geo” tagging of the articles as quite inadequate, though. Merely adding a longitude and latitude to an article seems quite short-sighted, because there needs to be further definition of what is being geotagged. If an article is about multiple locations, it would be ideal to label each geotag to tell what item is being located. Further, it would be ideal to label the article with an assumption of the geographic region that the article should be expected to appeal to. Is it mainly of interest to people within a particular city, state/province, region, nation, or is it of international interest? Still, having some geotag is better than nothing.)

For any marketers out there considering adopting the hNews Microformat standard, I’d advise waiting until the dust settles on this one. Other microformats developed perhaps more objectively, and there’s a lot of distrust and disaffection with the heavy news industry influence that is involved in this proposed standard. Currently, I’m not convinced that it will be widely enough accepted to become valuable for use. While having AP partners all adopting the standard may be sufficient enough to reach a tipping point where many other sites and companies will make use of hNews, Google’s public response to it was unusually cold-sounding.

Blogger/reporter Matthew Goldstein quotes Google’s response on the matter: “Google welcomes all ideas for how publishers and search engines can better communicate about their content. We have had discussions with the Associated Press, as well as other publishers and organizations, about various formats for news. We look forward to continuing the conversation.” While sounding expectably neutral and noncommittal, Google is also stating that this has not been widely-accepted by everyone, even within the news industry itself. This in combination with widespread skepticism within the developer/microformat community and blogosphere signal that hNews may have a very long way to go before it becomes something worthwhile for optimizing articles on publisher sites.

So, for now I advise avoiding this proposed standard, sit back and see how the dust settles. If you’re already syndicating content via RSS and Atom feeds, then you’re already distributing your content in a manner that’s easily absorbable and readable by search engines.

Save Yourself A Thousand Dollars On Simba Yellow Pages Report

Simba Information has released a report on the state of the yellow pages industry entitled “The RBOC Bankruptcies 2009: The Impact on the Future of the Yellow Pages Industry” and will offer a webinar this week to those who bought it. At $995, I think the report is likely overpriced, and I thought I’d save you some money if you were tempted to pay that much to find out why some of the major yellow pages publishers are filing for Chapter 11 bankruptcy protection, what this means for the industry, and where things are headed.

Yellow Pages

I guess I’m reacting to the somewhat hyperbolic language found in the press release which I think was intended to appeal to fears of yellow pages publishers possibly the very people who should least afford to pay this much for the analysts’ report.

First of all, I think it’s a stretch to refer to Idearc and R.H. Donnelley as RBOCs. Since Idearc was separated from the telco function of Verizon and then spun off, I don’t believe people really consider it to be an RBOC any longer. I don’t think R.H. Donnelley could ever have been considered an RBOC, even though it acquired directory parts of old RBOC companies. “RBOC” refers to “Regional Bell Operating Company”, used to describe those companies which originally made up the American Telephone & Telegraph Company, earlier known as Bell Telephone Company, which were broken apart into separate regional companies as part of antitrust requirements. The main focus of the original splintering of the RBOCs was placed upon the phone services, and the general convention is to consider those telco functions as being the “phone company”, while when non-telco company portions are spun off, they are no longer refered to as “RBOC”. This is maybe pedantic of me, but I think such loose accuracy of description is inauspicious in an expert report.

Secondly, there’s not a whole lot of mystery about why Idearc and RHD got into financial straights and had to file for Chapter 11. Both were overly debt-heavy and when the economy turned sour, they could not properly service those debts. I wrote in-depth about Idearc’s case in a post on Search Engine Land originally titled “Idearc’s Chapter 11 Bankruptcy: Who’s Really Responsible?“, and you can see Bloomberg’s and other reports stating that R.H.Donnelley’s bankruptcy was due to overly high debt. Yell Group’s problems also stem from debt. Ambassador Media Group, another well-known yellow pages publisher, has also filed for bankruptcy protection this week as well.

So, let me save you a thousand dollars with the simple explanation of this. For a hundred years, the print yellow pages industry was a very profitable business. It was a very safe bet, as investments go. Such a long-standing business model, “ecologically adapted” to be interdependent with many other businesses, was simply not expected to see any major declines. However, the technological disrupters of first the internet, then Pay-Per-Click advertising, and then the Google search engine all had a very unforeseen effect. These companies increased capital investment, expecting longterm wins, but the rapid erosion of print advertising undermined their ability to pay on their loans quickly enough. Even though there’s increasing profitability on the part of their internet sides in many cases, the volume of internet profit is insufficient to both cancel out the losses in print revenue and simultaneously pay off their loans. In Idearc’s case, I further outlined how they were sandbagged from the very beginning by Verizon offloading an unreasonable debt load upon them.

What does the future hold for Yellow Pages?

Immediately, these companies which are restructuring will come out far stronger. They will be forced to further pare down their print divisions. Print will continue to see erosion in revenues, since overall usage is declining, just as it is with other print media (I solidly established that the yellow pages industry’s own statistical projections were considerably inaccurate, and that print directory usage likely continues to drop each year).

I’ve also been stating for quite some time that there appear to be too many players in the internet yellow pages (IYP) sector, and that I foresee collapsing of this is likely we can expect that there are likely to be some mergers of these companies in the near future.

There is also further weakening of these online directories in terms of marketshare from my perspective. For now, they can be profitable, but I see too much incestuous interselling among the players. It’s possible that once collapse within this sector occurs, that the resultant players left standing may be strong enough to continue competing and to grow. But, there is significant cause to be concerned with the growing local search marketshare taken over by the major search engines such as Google. If the IYPs cannot improve their game well enough and rapidly enough to compete with the major search engines, then there will continue to be financial instability on the parts of the yellow pages companies.

Simba’s press release mentions in passing how “…bloggers jump to their computer keyboard and pound out a call for the outright ban of books for the good of the people whether they want them or not and toss in the good of the environment as well…” wording that plays very well to those in the YP industry which have been very defensive about the attacks on the printed books. Yet, rather than playing up to the anti-environmentalist defensiveness of the YP industry, it’d probably be more productive to resolutely face into the difficult current truths. People who don’t use the printed media are increasingly irritated by having the books from multiple providers dropped unsolicited on their doorsteps these days, and environmental progressivism is a popular and rising trend, turning mild irritation into full-frontal attacks. It’s undeniable that in quite a number of markets throughout the U.S. there have been significant movements to restrict directory distribution. Quite simply, this trend is going to continue, and the industry’s thin bandaids in many cases are not going to perform well at resisting the attitudinal change.

Finally, why should you trust my analysis more than Simba’s (even though I’m saving you the thousand dollars)?

For one thing, I was one of the earliest analysts to state that I saw weakness in the yellow pages industry, and later that there were serious problems in store for yellow pages. There were quite a number of other research firms and analysts that cater to the yellow pages industry that were offended back then by my findings, but it’s now undeniable that print yellow pages have indeed experienced substantial declines. I forecast the decline, I warned of weakening in print, and I stated it out loud, even as other major analysts were dismissive and even angrily reactive. I simply observe the facts and attempt to project realistic possibilities rather than merely catering to popular notions I’m not afraid to speak the whole truth as I see it.

Interestingly enough, AT&T’s directory division hasn’t been experiencing the same degree of problems seen by other directory publishers, but they’ve been kept “within the fold” of the overall AT&T telco corporation, which can insulate them from problems experienced by the standalone directory companies. Simba’s webinar is including Frank Jules, AT&T’s president & CEO of Advertising Solutions, but I’m not at all convinced that AT&T’s yellow pages group will be all that informative beyond speaking to their directory products offered.

As I pointed out in my article showing weakening in online searches for the “yellow pages” keyword phrase, online consumers appear to be seeking out yellow pages sites less and less. It stands to reason that as Google’s blended search bubbles up local businesses to user keyword search requests more simply, there’s less reason for those consumers to be seeking out business directories. The younger generation is forgetting what a “yellow pages” is altogether, and sites like AT&T’s Yellowpages.com which have placed all their branding around the eroding concept will stand to lose out.

Simba’s report undoubtedly will have some good information in it. But, will it really be worth a thousand dollars? I seriously doubt it. If you’ve read my blog post here, I think you can safely save yourself the cost.

Are You an Online Marketer or Just an SEO?

At SES London, Mike Grehan headed up an Orion Panel with Jill Whalen, Brett Tabke, Chris Sherman, Kevin Ryan and Rand Fishkin. The panel was taking a look at “SEO Where To Next”. I’m not going to rehash what went on at the panel, if you’d like a run down Paul Madden did a good summation of it. What I am looking to discuss is our roles, are we just SEO’s, PPC practitioners or affiliate marketers, or, are we online marketers?

What prompts me in asking this, is how in the past 2 years the rise of “Web 2.0” (I really hate that term) has begun to affect how people consume content, media or anything on the web. Focusing on just SEO, PCC or even Affiliate Marketing, we tend to rely very heavily on the search engines. Heck, we live, die and cry by what Google does. Take a look at the announcement by Matt Cutts about the canonical tag, the search marketing world went nutz!

But what happens when more and more surfers on the internet stop using the typical search engines to find what they need? Confused? Let me explain.

With the advent of the iPhone and its open application system, you no longer need to go to Google to find a nearby restaurant. That’s right, iPhone users have a bevvy of applications that connect them to the internet without a browser and without going to Google and getting a map with a list of restaurants. OpenTable will tell you which restaurants near you have available seating, Urban Spoon does just about the same thing.

It’s not just the iPhone either, AccuWeather just launched a nice little widget much better than than the dreaded desktop “WeatherBug” app(that adds those dreaded tracking cookies that Norton catches). Through the slick Adobe Air backend, AccuWeather tells me my weather without opening a browser and typing in “Weather 19468”. There’s also a nice AdobeAir Application called Tweetdeck to help you manage Twitter, never having to connect to a browser to hold a relevant conversation.

Facebook and Myspace both have phone applications for iPhone, Blackberry or just about any smart phone out there. It’s becoming easier and easier to connect to the internet and the sites you want, and to find the things you want without using a browser or even a search engine.

So with that in mind, I posed this question to the panel. With the ability to connect to the internet w/o a browser, is it the SEO’s job to still work with these types of applications? Only one panel member answered, bravely, Rand Fishkin said he didn’t believe this was the SEO’s job.

I agree, to a point. If you define yourself as an SEO who just optimizes web pages or websites, then yes, he’s right.

But if you have an eye on the future of marketing and are seeing what new technologies are emerging and being embraced in our world, I have to disagree with Rand, in that, that view is really limiting. Businesses are going to have to embrace moving even beyond just the typical web page for an online presence. Search Engines aren’t just browser based anymore, the OpenTable application demonstrates that to a “T”. As responsible online marketers, we have to look beyond just websites and Google, we have to look at the entire online presence, and move beyond the thought that SEO means web based search engines because it doesn’t. So are we SEO’s or Online Marketers, or perhaps both? I guess in the end its how you define “SEO”.

That leads me to wonder this question, is the holy grail of search – the “Google Killer”, just going to be the inevitable change of end user habits? Interesting thought isn’t it? 🙂

I’m A Social Media Goody 2 Shoes … And Proud Of It

By Liana “Li” Evans

Goody Two Shoes Comic by Flickr User ebbourg So yet another controversy when it comes to social media. I woke up to a plether of IM’s, Private Tweets and emails, to find out I’m a “Goody2Shoes”. I guess I could be upset, but I’m not. It’s par for the course in the world of Search these days. I could lash out at SEOMoz, because as many have pointed out, they let a post go to their own blog that attacks competitors (It has now been edited, but point being they originally let it out with the rather rude attacks on Matt Cutts, Lisa Barone and myself). I’ll let all those comments on the post speak for themselves. I’m sad that SEOMoz chose the path of inciting drama and discourse, but in the end that’s Rand’s business decision where to take his business, not mine. The drama gets the site links, and traffic, and I guess that trumps everything.

As for what Marty wrote about both Lisa Barone and myself and choosing to post it on SEMoz rather than taking ownership for it on his own blog, I can only guess he really needed the larger audience for the message he wanted to convey. I read Marty’s apology, “Lii and Lisa are pillars in this community…”, while I’d like to think it’s genuine, I was on the panel in Toronto, where I heard his example of vanity baiting in his presentation, I can’t help to think and question that this might another example of it.

As for my stance, I also guess when you take a position that fake profiles on StumbleUpon, and adding lots “fake” friends to make yourself look more popular, is not a sound strategy for entering the social media space, undoubtedly you’ll get flack, from those who find no flaws with this strategy. It happens, we all have different moral compasses, we all have different things that drive us to be what we perceive as a “great marketer”.

When I was taken aback by the tactics my co-panelists in Toronto presented and posted about it, I wanted to make sure I wasn’t off base. I asked a few people who just use social media without any knowledge of search or marketing what they thought of these tactics. The first person I asked as a 14 year old son of a friend who is an avid MySpace user. I asked him what he thought about adding all these famous people as friends, his reply was just one word “Lame“. I asked a friend I hang out at karaoke with the same question, her reply was “that’s just stupid, why would you friend them unless you liked them?

Next I asked a few people who I know use StumbleUpon for pure enjoyment, they have no marketing background, what they thought about people building fake “avatars”, or “fictitious profiles” on the service (btw, that’s a blatant violation of StumbleUpon’s TOS). My one friend from the EU said, “isn’t that illegal here?” (only illegal in the UK, sorry to say), another said “people do that? why in the world do they do that, that’s just crazy, and wrong, can’t they be honest?

Now if everyday people (not marketers) are saying this about these strategies, why would I advise my clients to implement those strategies? I wouldn’t and I wouldn’t promote doing this in a session at a major online marketing conference. I don’t see how creating fake profiles (or avatars) gains anyone any kind of ground in the end, when you are found out to be a fraud, all trust is lost.

What’s wrong with being honest? Really now, what’s wrong with starting a conversation, and honest one with real brand representatives, not one greeted immediately by fake/automated avatars that want to be my friend?

The only reason I can understand why SEO’s seem so fascinated with “gaming” social media by creating fake avatars and adding all these “non-friends” is for power and links. That’s really not what social media is about, not to the people inside the communities – only to SEO’s does this seem to matter.

If advocating that in social media, marketers be real, engage honestly in conversations with an audience or their customer, is deemed as “Goody 2 Shoes”, well I’ll gladly, and proudly wear that badge.

*****

Now, I don’t know about you, but all this reference to Goody 2 Shoes, I really can’t get Adam Ant’s 80’s tune out of my head. 🙂

Photo/Comic Credit: ebbourg

Matt McGee Joins Our KeyRelevance Team

By Li Evans

Matt McGeeWe’ve got some exciting news here at the SEMClubhouse. Another great SEO mind has joined not just the clubhouse, but the KeyRelevance staff as well.

Matt McGee of the Small Business Search Marketing Blog, joined our team yesterday!

With companies needing to stretch their marketing dollars, adding Matt McGee, who specializes in working with clients to maximize the return on their online marketing investment, was a great expansion to the KeyRelevance team.

“Google’s Universal Search changed the rules of online marketing,” said Christine Churchill, President and CEO of KeyRelevance. “Search engine optimization still rules, but now it’s the tip of the iceberg of what we need to provide to clients. Online marketing now encompasses not only SEO and Pay Per Click, but blog and video optimization, local and mobile search, social media marketing, and much more. Matt’s specialized knowledge in these areas makes him a valuable addition to our already robust team.”

KeyRelevance’s online team includes well-known SEOs Bill Slawski (SEO By The Sea), Li Evans (Search Marketing Gurus), Jim Gilbert, and now Matt McGee.

“In addition to being a first class SEO, Matt is one of the most positive people I’ve ever known,” added Churchill. “He infuses the element of fun into the workplace.”

A seasoned marketer, Matt has been online since 1994. Matt is a regular speaker at major search industry conferences including Search Engine Strategies, Search Marketing Expo, and Small Business Marketing Unleashed. He is also a columnist at Search Engine Land. In his spare time, Matt runs the Small Business Search Marketing blog and one of the oldest and largest independent U2 sites on the Internet at @U2.com.

“KeyRelevance is one of the most respected companies in the search marketing industry, and it’s an honor to join a team with such impeccable credentials. I’ve known Christine, Li, and Bill for years as friends and peers. I’m excited to join them and the rest of the KeyRelevance crew,” says McGee.

What Is Social Media’s Purpose? Honestly, It’s Not About Links

By Li Evans

What do you use social media for?

Do you use it to gain links? How about power? Maybe to trick people into thinking you are someone else? Perhaps as leverage to con someone into doing something on another social media site for you?

HonestyAt SES Toronto I was on the Social Media Success panel. I took this panel very seriously, I wanted to demonstrate how companies are using social media and creating their own success stories. The companies I chose to highlight wanted active conversation, true audience engagements and honest reviews and because they took that approach they had incredible success. I believe with every ounce of my being, social media is about conversations and sharing. I have a huge issue with applying shady link acquisition tactics, power manipulation and common trickery to social media.

There are people in the search industry that think social media is a numbers game, a numbers game that involves links. On the panel there were things presented that made my jaw drop, basically “shady” techniques, things like adding friends just for the numbers, creating multiple profiles, vanity baiting, and using your power on one social media site to gain something on another. To my colleagues on the panel, social media was all about the links and perceived power. Success to them in social media seemed to be about how many links you acquired, and what seemed to be cheap and fast tricks to get them.

I wasn’t alone in my dismay, Rahaf Harfoush expressed her shock at the lack of ethics presented.

People in the search industry wonder why SEO gets the stigma of being the “snake oil salesmen”. People in the search industry wonder why big companies are snubbing SEO, and don’t even look to SEO practitioners for Social Media assistance. Well when you try to apply SEO practices to social media wherein you are using it to gain links alone, or try to manipulate people into thinking things are true that aren’t, that’s how that reputation emerges, and the snubbing occurs.

Social Media is not about links.

Honesty is the Best PolicySocial Media is about conversations and the opportunity to share experiences through those conversations. Links are merely a by-product of a great social media campaign, and search engine rankings are merely a by-product as well. If you are measuring success in social media by the number of links you’ve acquired, you are really and truly missing out on what social media is all about.

What’s going to happen when Google finally devalues links from websites and looks more and puts more weight into what’s going on in social media? Social media offers so much more opportunity for the general public to voice their opinions about brands, products, companies and their opinion of what is really relevant, more so than a meager link from a website. Think of it this way, more people on the internet today participate in social media, than own a website. Guess what? These people are actively telling Google, Yahoo and MSN what they think is relevant by rating, commenting and participating in social media.

No fake profile, or adding friends, or using your “perceived power” is going to be able to easily change this, once it comes.

Remember, those discussions that are happening in social media channels, happen whether you are actively engaged in that conversation or not. So wouldn’t your time be better spent involving yourself with those conversations actively? Or would it be better spent adding a ton of fake friends to MySpace, conning a top Digg user into submitting your link for exchange of Wikipedia article help, or creating fake profiles on StumbleUpon?

Use social media for true customer engagements, be transparent, be honest, be who you are. People want to interact with real people from companies, they want Truth in Marketing. They want to tell stories about how great your employees are, what kind of heart you have and how you care about your customers and audience. The audiences couldn’t give a damn about your links, or how many sock puppet accounts you have.

Maybe when the search industry stops thinking of links first with social media, they will be taken a bit more seriously in the online marketing arena.

Yahoo on Web Mining and Improving the Quality of Web Sites

by Bill Slawski

A successful web site is one that fulfills the objectives of its owners and meets the expectations of the visitors that it was created to serve.

This is true of ecommerce web sites, news and informational sites, personal web pages, and even search engines. And, it’s a topic that even the search engines are exploring more deeply. A recent patent application from Yahoo tells us that:

The Web has been characterized by its rapid growth, massive usage, and its ability to facilitate business transactions. This has created an increasing interest for improving and optimizing websites to fit better the needs of their visitors. It is more important than ever for a website to be found easily on the Web and for visitors to reach effortlessly the content for which they are searching. Failing to meet these goals can mean the difference between success and failure on the Internet.

User Query Data Mining and Related Techniques, (US Patent Application 20080065631), by Ricardo Alberto Baeza-Yates and Barbara Poblete.

The patent filing discusses how information about queries that people use, collected from search boxes on a site (if one is used) and from search engines bringing people to a site, can provide useful and helpful information about how people use that site.

The collection of this kind of information is often referred to as Web Mining, and looking closely at the words people use to find information on a site can tell us something about the actual information needs of those visitors.

Search engines have studied searchers’ queries mostly to try to make search engines work better, but looking at the words people use to find a site, and to search within it once they have found it, could help to make the web sites themselves better.

The abstract of Yahoo’s patent filing notes:

Methods and apparatus are described for mining user queries found within the access logs of a website and for relating this information to the website’s overall usage, structure, and content. Such techniques may be used to discover valuable information to improve the quality of the website, allowing the website to become more intuitive and adequate for the needs of its users.

One tool that many site owners use on their pages are analytics programs, though often those are looked at to see how much traffic is coming to a site, and possibly to determine which words people are using to find a site. Analytics programs can provide a stronger role in helping people with web sites improve the experience of people visiting their pages, and the success of their sites.

Web Mining

The Yahoo patent is interesting in that it focuses less on how a search engine works, and more on how the owners of web sites can use the process of Web mining to discover patterns and relations in Web data. Web mining can be broken down into three main areas:

  • Content mining,
  • Structure mining, and;
  • Usage mining.

These relate to three kinds of data that can be found on a web site:

  • Content — the information that a web site provides to visitors such as the text and images and possibly video and audio, that people see when they come to a site.
  • Structure data — this is information about how content is organized on a site, such as the links between pages, the organization of information on pages, the organization of the pages of the site itself, and the links to pages outside of the site.
  • Usage data — this information describes how people actually use the site, and may be reflected in the access log files of the server that the site is on, as well as data collected from specific applications on the site, such as people signing up for newsletters or registering with a site and using it in different ways.

Knowing which pages people visit and which pages people don’t can be helpful in figuring out if there are problems with a site. They can uncover a need to rewrite pages, or to reorganize links, or make other changes.

Mining User Queries

Understanding query terms used to find a site and to search on the site can help improve the overall quality of a site. Yahoo’s approach would be to create a model to use to understand how people are accessing a site, and navigating through it:

According to specific embodiments of the invention, a model is provided for mining user queries found within the access logs of a website, and for relating this information to the website’s overall usage, structure, and content. The aim of this model is to discover valuable information which may be used to improve the quality of the website, thereby allowing the website to become more intuitive and adequate for the needs of its users.

This model presents a methodology of analysis and classification of different types of queries registered in the usage logs of a website, including both queries submitted by users to the website’s internal search engine and queries from global search engines that lead to documents on the website. As will be shown, these queries provide useful information about topics that interest users visiting the website. In addition, the navigation patterns associated with these queries indicate whether or not the documents in the site satisfied the user’s needs.

Queries uncovered might be related to categories drawn from such things as navigational information found on a site.

Traffic through the site could tell someone using this invention how effective the site was at meeting the information needs of the people using certain queries. It could also provide suggestions for:

  1. The addition of new content
  2. Changes or additions in words found in anchor text in links
  3. New links between related documents
  4. Revisions to links between unrelated documents

Information Scent

Visitors to a site will follow links that use words within the links that provide some level of confidence that the information being looked for will be upon the other side of those links (The Right Trigger Words as User Interface Engineering’s Jared Spool calls them). Likewise, when someone searches at a search engine, and sees a page title and a snippet of text for a site in search results, the words used in the title and snippet may persuade someone to visit the page. This is true both for search results from a search engine, and search results from an internal search for a specific site.

Understanding what kind of information is being searched for regarding a specific query, and how the words used in search results, on web pages, and in links to other pages may provide some insight into making those search results, those pages, and that anchor text better.

The patent application describes how pages and the queries used to reach them can be classified based upon how they are typically used by a visitor – from external searches through a search engine, from internal searches through a web site search, or through navigation on the site itself.

It also classifies queries as successful or unsuccessful, based upon things such as whether someone visited a page in response to the display of a search result showing the page, or if they followed other links on pages visited to explore a site in more depth.

Seeing how pages are typically reached on a site in response to certain queries, and seeing which queries are successful and unsuccessful in bringing people to information that they want to find can help a site owner make positive changes to a site.

Example

The patent application provides an example using a portal targeted at university students and future applicants.

It focuses upon exploring how effective the site is when searchers use the queries “university admission test” and “new student application” in searches for the site both on search engines and on a site search for the site. Two initial reports evaluated how effective the site was without making any changes. Twenty of the top suggestions generated from reviewing the model described in this patent application were incorporated into the site’s content and structure:

The suggested improvements were made mainly to the top pages of the site and included adding Information Scent to link descriptions, adding new relevant links, and suggestions extracted from frequent query patterns, and class A and B queries.

Other improvements included broadening the content on certain topics using class C queries, and adding new content to the site using class D queries. For example the site was improved to include more admission test examples, admission test scores, and more detailed information on scholarships, because these were issues consistently showing up in class C and D queries.

The “class C” queries mentioned are ones where there was very little information available on the pages of the web site. The “class D” queries were ones for which there was no information available on the site.

One significant result of these changes showed an increase in traffic from external search engines of more than 20%, due to improvements in content, and in link text.

Conclusion

It’s interesting that a search engine would apply for a patent that explores how to use data mining to improve the quality, content, and navigation of a web site. It’s difficult to tell what Yahoo might do with the method describe in this patent application – whether they will only use it internally, or will offer it to others for a fee, or for free.

Many of the concepts described in this patent application are ones that site owners can presently use to improve how well their site meets their objectives, and the objectives of people visiting their pages.

Understanding the terms that people will try to use to find your pages, and the words and concepts that they expect to see on the pages of your site can make a difference in how successful your site may be.

Using analytics tools to understand how visitors who use certain queries will explore your pages and navigate from one page to another can provide even more value to both searcher and site owner, by pointing out changes that can be made to improve the experience of those visitors.

And those changes may just lead to more visits from search engines.

Google Says Users Won’t be able to Tell Paid Ads from Natural

by Jim Gilbert

By Scott Morrison, Of DOW JONES NEWSWIRES reports a top Google executive (Tim Armstrong, Google’s North American president for advertising and commerce.) of saying:

“Speaking at the Bear Stearns Media Conference in Palm Beach, Fla., Armstrong said Google’s advertising platform will evolve over time so that it won’t distinguish between search and display ads.”

Anyone care to comment on what the heck that means?

Advances in Crawling the Web

By Bill Slawski

There are 3 major parts to what a search engine does.

The first is crawling the Web to revisit old pages and find new pages. The second is taking those crawled pages, and extracting data from them to index. The third part is presenting pages to searchers in response to search queries.

There’s been some interesting research published recently on the first of those parts.

Crawling Challenges

Crawling the Web to discover new pages, and identify changes in pages that a search engine already knows about can be a challenge for a search engine. The major issues that search engines face in crawling sites involve:

  • How many pages they can crawl without becoming bogged down,
  • How quickly they can crawl pages without overwhelming the sites that they visit, and;
  • How much resources do they have to use to crawl and then revisit pages.

A search engine needs to be careful on how it spends its time crawling web pages, and choosing which pages to crawl, to keep these issues under control.

A recently published academic paper describes this important aspect of how a search engine works, the Web crawl, in more detail than most papers that have been published on the subject before.

Enter IRLBot at Texas A&M

The Department of Computer Science at Texas A&M University has been running a long term research project know as IRLBot which “investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web.”

In April, researchers from the school will be presenting some of their recent reseach in Beijing, China, at the 17th International World Wide Web Conference (WWW2008).

The title of their presentation is IRLbot: Scaling to 6 Billion Pages and Beyond (pdf), and the focus of the paper is this primary function that a search engine performs – crawling the Web and finding new Web pages.

Their research describes some interesting approaches to finding new pages on the Web, handling web sites with millions of pages, while also avoiding spam pages and infinite loops that could pose problems to web crawlers.

In a recent experiment that they performed that lasted 41 days, their crawler “IRLbot” ran on a single server and “successfully crawled 6.3 billion web pages at an average download time of approximately 1,789 pages per second.” This is a pretty impressive feat, and it’s even more impressive because of some of the obstacles faced while finding those pages.

Problems Facing Crawling Programs

Politeness

One challenge that faces Web crawling programs is that those programs shouldn’t ask for too many pages from the same site at a time, or they could use up too many resources of the site and make the site inoperable. Keeping that from happening is known as politeness, and search crawlers that aren’t polite often find themselves blocked by site owners, or complained about to the internet service provider hosting the crawler.

URL Management

As a crawling program indexes a site, it needs to pay attention to a file on the site known as a robots.txt file, which provides directions on pages and directories that the crawling program shouldn’t visit, so that it doesn’t crawl pages that it isn’t supposed to see. The program also needs to track which pages it has seen, so that it doesn’t try to crawl the same pages over and over again.

Avoiding Spam While Crawling

The crawling process described in the paper also tried to limit the crawling program from accessing pages that might more likely be spam pages. If a crawling program spends a lot of its time on spam pages and link farms, it has less time to spend on sites that may be more valuable to people searching for good results to their queries at search engines.

One key to the method used by this research team in determining how much attention a site should get from their web crawler was in looking at the number of legitimate links into the site there are from other sites, which is what they refer to as domain reputation.

Why This Paper is Important

The authors of the paper tell us that there are:

…only a limited number of papers describing detailed web-crawler algorithms and offering their experimental performance.

The paper provides a lot of details on the crawling process and the steps that the Texas A&M researchers took that enabled them to index multi-million page web sites, avoid spam pages, and remain “polite” while doing so. It explores the experiments that they conducted to test out ideas on how to handle very large sites, and crawls of very large amounts of pages.

They conducted their experiment using only a single computer. The major commercial search engines have considerably more resources to spend on crawling the web, but the issues involving managing which pages they choose to index, being polite to sites that they visit, and avoiding spam pages are problems that commercial search engines face too.

Learning about how search engines may crawl pages can help us understand how a search engine might treat individual sites during that process. If you are interested in learning about the web crawling process in depth, this paper is a good one to spend some time reading.

The Web in the World – Looking for URLs Offline

By Bill Slawski

If you run a business, and own a web site, it’s not a bad idea to include the address of your site on your invoices, your business cards, within the letterhead of your stationary, and other paperwork that comes out of your office. You may even want to include that URL on shipping boxes, on your business sign, and in other places where the address might be visible.

Every few months, I like to take a walk through the small town I live in, with a pen and notepad in hand, and look for web addresses in places that I haven’t seen them before. On a normal day, I don’t think that I pay too much attention to how the Web and the world interact on a stroll through town, but I see some surprises when I start looking more closely.

My town is a University town, and most of the students are away on winter break, which made this morning quieter than it is when school is in session.

I start searching for URLs as soon as I get out of my front door, and the first one that I see is in a nearby parking lot. There’s a Marine recruiting station close by, and a number of recruiters’ cars in the lot. A number of them had written across their sides and back the Web address “marines.com” and “1-800-marines.”

As I walk past them, I decide to stop for a cup of coffee at one of the local coffee shops. Next to the credit card logos on the door of the shop is a small sign advertising a University meal plan. Students can pay for a card which they can use to buy food at different eating establishments in town, and these signs let them know which ones accept that meal card. It also acts as an advertisment for students, so that they can find out more about the program, and the URL is shown so that they can find out more about the service.

I grab the local paper while I’m getting my coffee, and start looking through it for Web addresses. A front page banner ad, below the fold, looks more like it was designed for a web portal than news print. Appropriately, it advertises a web site.

Turning through the pages of the newspaper, I’m starting to see ads that don’t carry a street address or a phone number – just a URL. I wonder how many of them are actually local businesses, and how many are located somewhere else. The advertisements are for items that could be anywhere in the world.

I finish looking through the paper, and and leave the coffee house onto Main Street, when a bus passes by. I expect a URL on the bus, and don’t see one. I’ve seen their schedules online, so I’m surprised that they don’t include their web address next to their name.

A sticker from a local band, pasted on a utility pole catches my attention, and it provides a URL for their MySpace page. Another sticker, sloppily attached to a mail pickup box a little further down the street is for the state National Guard, and shows their toll free phone number, but not a web address.

A sign at the post office provides a list of dates that the the office will be closed, but tells us that “We’re always open at usps.com.” I’ve been wondering why they didn’t choose the name “mail.gov.”

A paper company truck is stopped on Main Street, to make a delivery, and the side of their truck is a billboard for their goods. Under the sentence where they tell us that they’ve been around since 1919 is the URL for their business.

As I return home, I notice that I’ve received my mail, and on the back of one of the envelopes, I see a message that I can pay my bills online, along with a URL. I’m not sure if I’ve seen a envelope with both web address, and a call to action like that before.

I think I’ve seen more URLs on this walk than I’ve seen in previous trips through town. There are a few on business signs, and on posters in store windows, and in notices posted on the community bulletin board. Next time I try this, I’m going to have to take my phone with me, and see how well those show up on a screen for handhelds.