URL Shorteners That Frame Websites Hijack Your Content

By Liana “Li” Evans

hijackinghotspotWith the rise of Twitter and it’s limit of 140 characters (250 if you turn off javascript), when it comes to maximizing space to get your message across, every character counts. With that fact in mind URL shorteners are cropping up all over the place. There are some great URL Shortening services, Tweetburner, Bit.Ly, TinyURL and Cli.gs are some great services and actually will track your click throughs.

Then we have another new crop of URL shorteners appearing. These “frame” your content underneath their own branded bar. Digg of course is the biggest well known implementer of this kind of bar. There are several others that do this as well, Ow.Ly and BurnURL are just two. So what’s the big deal, why all the fuss? What could be wrong with what Digg’s doing, after all they are still sending you traffic, right? Well to start with, some of these services have the potential to play havoc with some analytics code. Then there’s the whole “hijacking” of your URL, which is likely one of the things that surfers on the internet are trained to remember, this is essentially hijacking your content for their own benefit – increasing the number of uses of their service.

What’s the difference between what Cli.gs does and what Digg does? Well Cli.gs does a 301 redirect straight to your content when someone shortens your URL, therefore when people click on a shortened URL done by Cli.gs you end up on the content and see the true URL. What Digg does is puts your content under their bar, with their own URL. The visitor NEVER, EVER sees your full URL. Sure some of these allow people to click out of the bar and show you a truncated URL stream to click on, but it’s certainly not the same as someone looking into the address bar for your site’s URL.

What happens when they want to bookmark your site and then entered through Ow.Ly, BurnURL or Digg’s bar? Their shortened URL is what is bookmarked not your site’s URL, doesn’t matter if they are bookmarking to their browser or to a social bookmarking site like Delicious or even StumbleUpon. Again, they are highjacking your content by keeping the framed bar with their URL in the address bar and not 301 redirecting like the other URL shortening sites are!

Sure, some of these URL shorteners that put the frames around can say “oh we make it easy to share with out pull down menu”. Well here’s the thing, people are already “trained” to bookmark or stumble through the bars they have installed in Firefox or IE, that’s where they are going to go first, not to a pull down on a frame. It’s tough to retrain people who’ve been stumbling or bookmarking for well over two years to use some “framed bar” from a new service that isn’t familiar to them, they are going to go with what they trust.

content-hijackThen lets look at the whole “oh I found this I want to blog about it” piece of the marketing and social media puzzle. Someone who finds some great content via one of these framing URL shortening services and isn’t quite tech savvy, pulls the shortened URL from the address bar. Guess what, your site doesn’t get the credit for that link, the shorten URL does. Again, this is basically like hijacking your content.

These URL shorteners make claims that it makes it easier to get your content to be more viral. Personally, in my honest opinion, that’s a load of bunk. It isn’t this tool that makes the content go viral – it’s the perceived value of the content itself that makes something go viral. Then stop and think, what is the sense of your content going viral if the visitors viewing it can’t even see your URL? What is the sense if they themselves can’t share it properly with their own communities like StumbleUpon, Delicious or Magnolia? Your URL is how people remember you, and a lot of sites don’t put their URL in their graphics or headings, they rely that its always going to be in the address bar.

I’ve been having discussions on Twitter about this, and one person claimed I was afraid of them stealing my “Google Juice”. I had to suppress a laugh at that term. I guess because I came into the industry as an SEO, some people will assume I “want my Google Juice” darnit! It’s not about Google Juice at all, at the end of the day this is about who owns the content. The publisher owns the content – not these framed URL shortening services who are hijacking URLs. It’s about it’s perceived value to the visitor and if the visitor perceives its value to be great, shouldn’t the original publisher get that credit, not these framing URL shorteners?

Here are some other great reads on this subject:

Are You Blogging or Doing Social Media for SERPs & Links?

linksA lot of companies hear a lot about the social media space. Most of what they hear revolves around Blogs, Digg and Facebook and immediately they think “I have to be there!” Whether its because its the newest fad, their competition is doing it or that they’ve been shown that it can get the SERPs or better yet links, a lot of times companies never stop to look beyond the shiny pretty wrapper of social media to look at what’s really involved when heading down the social media path. At the end of their path, generally it ends in thinking social media has failed them. Why? The major reason is entering into the space for the wrong reason, like acquiring links or getting more footholds in the Search Engine Results Pages (SERPs).

Social Media Requires Resources

Just because a service is free to utilize, as in it costs nothing to sign up for services like WordPress, Blogger, Facebook or Digg, thinking that it is free is a misnomer. Companies need to stop and think about the resources it will cost them in time & effort of their employees to manage a social media strategy. It takes time to grow a powerful account on Digg, if that’s the way you want to go. It takes not only time, but planning, to create a blog that will last. When working on forums, employees need to take time out to respond to messages, threads and pose new questions.

Companies looking to outsource this effort will still have to pay someone to do it, but they could also pay in bigger ways. Having someone, or some company, answer your responses for you, make friends for you, manage your social media profiles for you – can literally turn into a nightmare if it’s found you are not being transparent about it. Anytime you try to automate your social media efforts to be more efficient and less time consuming can also turn you in the direction of facing a Public Relations nightmare with your audience. If an audience feels you aren’t being transparent – upfront about your actions, willing to listen and have a conversation – you’ve lost their trust and its very tough to get it back again.

Social Media Requires Listening

There’s no way around this. In order to understand what your target market wants and how you can provide them value, companies have to take the time, stop and listen to what their audiences are saying and talking about in the social media circles. Coming in and trying to slam marketing or advertising down their throats or just starting to blog about their industry will not get you very much – just a whole lot of crickets chirping. Audiences what to know and feel like they are being heard. That their experiences matter, that what they share with others can some how help even if in a small way. True rewards in the social media spaces aren’t coupons, special discounts or freebies. People feel rewarded when they can help better a product, share a new way to use a service or help create something – feeling like they are part of something is one of the true rewards of social media and in order to give your audience that opportunity, you have to listen to understand what they want to be part of.

Social Media Requires Conversing

Just like with the listening, there’s no way around this either, not if you want to have a successful venture into Social Media. You can’t just lurk in social media. Hiding out in forums, seeing what people are saying about you, then issuing press releases to “correct the wrongs” or launching some other program to “fix what’s misunderstood about our company/product/brand” doesn’t work. A lot of times by just lurking and not getting involved in the conversation, companies can totally misinterpret what the audience is really saying.

By taking the time to speak to the audience and become part of the group, you build a trust that no press release will ever garner you. You build relationships no article in the news media will every let you create. You touch people on a more personal level and they in turn can relate that personal story to all of their friends, and so on. Conversing in the social media realm also puts a more human touch to your message or your marketing efforts. People want to connect to people, not buildings, not marketing pieces of paper or websites, not systems or gadgets (although iPhone users can argue differently) and you connect through holding conversations.

Social Media Requires Providing Value

Just putting up a blog that regurgitates your press releases, articles on your site or some boring piece about another product launch doesn’t provide value to your audience. That’s all about you, and what you perceive value to be. Audience perceive value totally differently. Give them a new or interesting way to use your product or service that they might not have thought about – or better yet, ask one of them to help out with creating the piece about the new way to use the product – now that’s value an audience can relate too. Don’t just write about it either, shoot photos or even a video and create even more value.

If you stop and first think about, “what will my audience find valuable in this content”, rather than “how many Diggs will I get”, your success with your content will turn out a lot better. By focusing on the value you can provide, it puts the focus squarely on your audience and off of you. In social media it’s not at all about you, it’s about the value the customer/audience gets from you that’s the most important factor.

love-what-you-do

Social Media Requires Passion

Considering building a blog because it will get you some “link juice”? Want to get posts out there because they’ll rank for certain long tail key word terms? It may seem like a great idea at first, but unless you’ve got someone who’s passionate about the subject that your blog is about and willing to be social in the community beyond the blog posts, your blog will go no where. Blogging is about sharing your passion with a community for something whether its your life, your hobby, what your company does or the industry your company is in, you have to have someone writing who loves to write about it and wants to talk to others about it. It also extends into other forms of social media. Participating in forums? Having a person passionate about helping people understand your company or product or industry goes a long way in building relationships and trust. If you have someone out there that is just doing it because “its their job” or they were “mandated” to do it, will do you more harm than good.

Outsourcing your blogging can also shine right through, too. If the company you choose to “ghost write” your blog isn’t deeply involved in your industry, a lot of your posts will come off flat, probably overly SEO’d and read like a true marketing piece. Look at successful company blogs like Nuts About Southwest, GM’s FastLane or even Bill Marriott’s mix of podcasting and blogging, all of these are wonderful examples of companies not just blogging about the company but their industry, their employees and customers. Asking you to buy their products, announcing a sale or a new pricing structure from their blog is the furthest thing from their minds, unless its something the audience has asked for.

The Reality of Social Media With Links & SERPs

It takes a lot of time and resources to be successful in social media if your only end goal is getting links or SERPs from it. These are natural byproducts of a truly good social media effort. What you never hear about some of these “overnight successes” is that it takes a lot of man hours creating content that is of value for an audience, as well as being truly social (listening and conversing with your audience). Just because you’ve gone out and bookmarked your blog post, or posted a picture in Flickr or a video in YouTube doesn’t mean you’ll be successful. There’s another entire realm of involvement here that companies need to take into account when planning their social media strategies. None of this really works unless you are being social on some level.

Profiles don’t gain “power” unless they are out socializing with the community – making friends, commenting, rating, adding media, etc. Just because you made a profile in MySpace or a page in Facebook, doesn’t necessarily mean it will take a spot up in the SERPs anymore, 2 years ago, yes, now only if you’ve got an obscure name. The search engines are looking to different signals within the profiles to understand if people are finding these profiles relevant. Sure they still look at links, but now also weighted into the mix are ratings, comments and interaction factors. If you create the “optimized” profile and just let it sit there its not going to do you a whole lot of good.

In the end, you need to plan your social media strategies around other success factors, not how many links you gain or SERP spots your take up. If you plan your strategy around other success factors, the links and SERPs will only naturally come because you efforts were successful in other ways. The links, the SERPs – in social media, they are just icing on the cake to a successful venture in social media.

Twitter Uses Microformats

While using Twitter this week, I realized their programmers had incorporated Microformats in the design! I noticed that my Operator Toolbar was responding to the Microformat content in the page, and making it available for me to export.

As you can see from my Twitter profile page, Operator has found Contacts and an Address available in the page. Note the “Contacts” and “Addresses” buttons in the browser toolbar are not grayed out, but are showing as clickable.

The Contacts is returning hCard Microformat info not just for me, but also for all of the 36 twitterers that I follow and whose icons appear on my profile page.

The Address is apparently supposed to be my personal profile’s address data, but it’s not interpreting quite right for me. I think this is because it places the entire Twitterer’s location content in the “adr” value, without breaking the content out into the street address, locality, region and country. Also, the hCard profile attribute isn’t included in the page’s tag.

Still, Twitter’s incorporation of the Microformats in the page code is exciting to me! Why? Well, I’ve written before about how incorporating Microformats can potentially be advantageous for the purposes of Local Search Optimization here and here. Essentially, this can help search engines to more easily interpret the address info on webpages and associate business information with webpages.

Yahoo! has been the fastest at adoption of Microformat content, with Google following close behind. Yahoo’s Search Monkey platform (which allows both Yahoo engineers and all other web developers to create applications which deliver up special webpage listing representations in Yahoo search results) has shown very clearly that Yahoo’s bot has been tooled to particularly harvest Microformat data from webpages in order to make special use of that amongst the various signals they get from sites.

Does Google use Microformats? Yes and no. Google Maps has incorporated Microformats in the display of their search results so that users can access, export and use business and address data easily. However, it’s not yet entirely clear if they spider that same data from local web pages as part of the info they collect in categorizing and ranking pages. Google Maps engineers have told me off the record that they watch all types of data like this, and if there’s a significant number of sites using it, then they will also make use of it in their ranking “secret sauce”. With a high-profile site like Twitter incorporating Microformats, there’s yet more incentive for Google to adjust their data collection algos to incorporate hCard data if they have not already.

In the past week, I wrote an article on how small businesses can and are using Twitter for local marketing. Twitter’s incorporation of Microformats further underscores the value of the service as a component of Local SEO.

Are You an Online Marketer or Just an SEO?

At SES London, Mike Grehan headed up an Orion Panel with Jill Whalen, Brett Tabke, Chris Sherman, Kevin Ryan and Rand Fishkin. The panel was taking a look at “SEO Where To Next”. I’m not going to rehash what went on at the panel, if you’d like a run down Paul Madden did a good summation of it. What I am looking to discuss is our roles, are we just SEO’s, PPC practitioners or affiliate marketers, or, are we online marketers?

What prompts me in asking this, is how in the past 2 years the rise of “Web 2.0” (I really hate that term) has begun to affect how people consume content, media or anything on the web. Focusing on just SEO, PCC or even Affiliate Marketing, we tend to rely very heavily on the search engines. Heck, we live, die and cry by what Google does. Take a look at the announcement by Matt Cutts about the canonical tag, the search marketing world went nutz!

But what happens when more and more surfers on the internet stop using the typical search engines to find what they need? Confused? Let me explain.

With the advent of the iPhone and its open application system, you no longer need to go to Google to find a nearby restaurant. That’s right, iPhone users have a bevvy of applications that connect them to the internet without a browser and without going to Google and getting a map with a list of restaurants. OpenTable will tell you which restaurants near you have available seating, Urban Spoon does just about the same thing.

It’s not just the iPhone either, AccuWeather just launched a nice little widget much better than than the dreaded desktop “WeatherBug” app(that adds those dreaded tracking cookies that Norton catches). Through the slick Adobe Air backend, AccuWeather tells me my weather without opening a browser and typing in “Weather 19468”. There’s also a nice AdobeAir Application called Tweetdeck to help you manage Twitter, never having to connect to a browser to hold a relevant conversation.

Facebook and Myspace both have phone applications for iPhone, Blackberry or just about any smart phone out there. It’s becoming easier and easier to connect to the internet and the sites you want, and to find the things you want without using a browser or even a search engine.

So with that in mind, I posed this question to the panel. With the ability to connect to the internet w/o a browser, is it the SEO’s job to still work with these types of applications? Only one panel member answered, bravely, Rand Fishkin said he didn’t believe this was the SEO’s job.

I agree, to a point. If you define yourself as an SEO who just optimizes web pages or websites, then yes, he’s right.

But if you have an eye on the future of marketing and are seeing what new technologies are emerging and being embraced in our world, I have to disagree with Rand, in that, that view is really limiting. Businesses are going to have to embrace moving even beyond just the typical web page for an online presence. Search Engines aren’t just browser based anymore, the OpenTable application demonstrates that to a “T”. As responsible online marketers, we have to look beyond just websites and Google, we have to look at the entire online presence, and move beyond the thought that SEO means web based search engines because it doesn’t. So are we SEO’s or Online Marketers, or perhaps both? I guess in the end its how you define “SEO”.

That leads me to wonder this question, is the holy grail of search – the “Google Killer”, just going to be the inevitable change of end user habits? Interesting thought isn’t it? 🙂

Local SEO Tip: Leveraging Categorizations To Promote Your Small Business

Many small businesses still rely upon their yellow pages listings to some degree for business referrals. A business’s listing data is not only found online in the internet yellow pages, but that same data feeds into the local search engines like Google Maps, Yahoo! Local, Live Local Search, and a myriad of other directory sites and local info sites.

A really simple way to increase your small or medium business’s (“SMB”s) exposure on the internet is to insure that your business listings are associated with as many apropos categories in online directories as possible.

While it seems a no-brainer, many SMBs neglect to check how various online sites have categorized them, and as a result get ineffectively and insufficiently displayed throughout the internet.

Back when I worked for a major online yellow pages company, I recall how one of our data aggregators had attempted to automatically categorize a great many non-classified establishments by using words in the business name. While this worked to some degree, a bunch of businesses were put into the wrong category. For instance, they’d shoved all businesses with the word “garden” in the name into “Garden & Lawn Supply” categories, but there are a lot of restaurants with names like “China Garden” that got lumped in with them!

So, today’s Local Search Engine Optimization Tip is easy: go out to the top online directory sites such as Superpages.com, Yellowpages.com, Yellowbook.com, Google Maps, Local.com, Yelp.com, DexKnows.com, Yahoo! Local.com, and any others you’d expect to find your business in, claim your listing(s), and check to see if they’re in the proper business categories!

Choosing Categories in Google Local Business CenterMany internet yellow pages allow you somewhere on the order of five category associations for free, so you should be able to update and enhance your listing in these services for free.

Try not to limit yourself to just two categories — if the directory site has a rich taxonomy, you may be able to find a number of categories that are exactly right for you business.

Imagine, if your company is only in one category and/or a wrong category, just by fixing this business categorization you could increase your referral rate by a few times over!

Duplication Solution Announced With Canonical Tag

Google, Yahoo! and Microsoft announced a joint agreement today at the SMX West conference in Santa Clara to support a new protocol which is intended to assist webmasters in reducing duplicate content issues on websites. All three are issuing blog postings about this, and Matt Cutts presented the new protocol in a session just a few minutes ago at SMX.

Matt Cutts explains the Canonical Tag at SMX West

This is a really exciting addition to the SEO’s toolbox! Duplicate content often occurs when webmasters accidentally create alternate URLs for the same content across their sites. The larger the site, the more likely it is to have serious duplication issues. This was one of the most difficult issues I used to work upon when I was in charge of SEO for Superpages.com — nearly any site which uses dynamic URLs with querystrings to specify how content is delivered end up with some level of duplication.

Here’s just a few examples of duplicate URLs:

The solution the search engines collaborated upon to solve canonical and duplicate content issues is very straightforward — one can add them within the HEAD tags of a document:

<head>
<link rel=”canonical” value=”http://example.com/page.html”/&gt;

</head>

Matt provided a number of caveats and advance clarifications about use of the tag:

  • It’s a hint to the search engines. Not a directive/mandate/requirement.
  • Far better to avoid dupes and normalize URLs in the first place.
  • If you’re a power user, exhaust alternatives first.
  • Does not work across domains.
  • DOES work across subdomains.
    (The example Matt gave was from Zappos’ new design subdomain: zeta.zappos.com vs. http://www.zappos.com)
  • Pages do not have to be identical.
  • Can one use relative / absolute urls? Yes, but we suggest absolute!
  • Can you follow a chain of canonicals? We may, but don’t count on it.

Matt added a further disclaimer about how search engines may not be able to handle some extreme cases, so don’t push the envelope too much. He said:

  • Point to a 404?
  • Or create an infinite loop?
  • Or point to an uncrawled URL?
  • Or www/non-www conflict?
  • Search engines will do the best they can.

Then, he jokingly quoted Ghostbusters in context to this: “Don’t cross the beams!”

"Don't Cross the Beams!" Ghostbusters

This whole protocol is really interesting and a great tool for webmasters to use. However, the caveats and strong suggestion that webmasters try to fix duplication content issues before resorting to this canonical tag would make me prefer to try to solve such problems instead of using this. It’s good to have the option, though!

Here’s the top announcement articles about this Canonical Tag protocol:

XMen Origins – Wolverine & 20th Century Fox Miss The Online Marketing Buzz

This past weekend the internet was buzzing. What were they buzzing about? The movie trailer for the new Wolverine movie coming out. It wasn’t on main stream news, where it was buzzing was on social networks, social news sites, video shares and forums as well as social communication channels like Twitter.

The trailer hit theaters as a lead in to the Keanu Reeves’ movie, a re-adaptation of “The Day The Earth Stood Still“. The first real big buzz coming Friday night. A smaller bit of buzz about the Wolverine movie came during Comic Con this year where they showed a slightly different trailer.

So how did 20th Century Fox stumble out of the gate on this one? There’s several ways, and as a marketer who’s well versed in online media, it just frustrated me to no end that these big movie houses still just do not get online marketing in any sense of the form.

What Happens When You Can’t Find The Website?

Let’s start with their website. Think you can find the official Wolverine website by typing in Wolverine Movie? How about Wolverine Movie Trailer? How about using it’s official movie title “X-Men Origins: Wolverine”? Nada – Zippo – Zilch. All through out the weekend I tried, today I took screen caps – no where in the top 10, take a look below (click the thumbnails to get a larger view).

Wolverine Movie Google Search   Wolverine Movie Trailer Google Search   X-Men Origins Wolverine Google Search

X-Men Origins Wolverine Official Site Google SearchTheir website is in flash, totally absolutely in flash with absolutely no content a search engine’s spider can read. The only thing it can read is the title tag for this site. Talk about being invisible to the search engines, and to the rabid Wolverine fans! It wasn’t until I typed in “X-Men Origins: Wolverine Official Site” did I get the movie site to come up in Google. Now tell me who the heck is going to type that in, other than me who was bound and determined to find the official site?

Video, Video, Video… It’s Where the People Are At

Now lets go to the subject of the trailer. Talk about needing to loosen control! 20th Century Fox definitely needs to loosen their death grip if they aren’t going to put their trailer out on their site the same day they release it in a movie theater. They also need to realize that when they don’t come up for “Wolverine Trailer” for their own site, they need to have it ranking else where, or someone else will. On Friday, Saturday and early Sunday there was still no Wolverine trailer on the official site, what in the world is wrong with their marketing team? Granted today when I went out to look the trailer is now there.

People were clamoring to see this trailer who didn’t want to go see this movie. Let me tell you, as a comic book gal, and a XMen fan from my childhood years, I was clamoring to see this trailer. I’ve been waiting like the rest of the XMen fans since the last movie to get more. We all scour the internet for clues, tidbits and the slightest bit of information we can glean to satisfy our need.

Thus why looking for this trailer became an obsessions with not just me, but others as well over the weekend. According to Groundswell, the author Charlene Li, points out that 29% of the people in social media are watching videos other people have made. Google was pulling down more trailers of Wolverine this weekend than you can imagine. But people were still searching for this trailer on YouTube and any other video share they could find.

Wolverine Trailer Search on YouTube

The Fans Take Action…. 20th Century Fox Misses Out

I did find it on another video share, I’m not going to say where, because I don’t want to see it taken down. I found another trailer from Comic Con too – and what’s amazing about that video, it captures people cheering during the trailer, talk about fandom! Cheering during a trailer – now that speaks volumes.

People were videoing the trailer from their phones while in the movie The Day The Earth Stood Still. They uploaded it to video shares and blogged about it. Why did they do this? 1) they love XMen, Wolverine in particular 2) they recognized that 20th Century Fox wasn’t filling their need or the need of others.

No where on YouTube is there an official Wolverine, 20th Century Fox, or Marvel Channel for the movie. What 20th Century Fox doesn’t realize is that there is real buzz going on about this movie. One look at Google Insights tells the story. Just over this weekend searches for Wolverine skyrocketed, several terms are break out terms with searches increasing over 1000% (I don’t get the big surge in Michigan though). None of these terms are pushing traffic towards the official XMen site either, and if you notice, none of these terms use the long arduous title that 20th Century Fox Does.

click images for a larger view
Google Insights - Wolverine - Trend and Map Data  Google Insights - Wolverine - Search Trend Data

So this leads to showing you the audience, a lesson in strategy in combining both SEO and Social Media strategies together when you are launching something big. When you understand online media, and aren’t having such a death grip on control of your brand, you can reap huge rewards. Unfortunately for 20th Century Fox, they are just making their fans of XMen and Wolverine not like them very much.

And btw the way, yes I did a fan girl squeal when I saw Gambit. 😉 ahhh Remmy LeBeau makes me weak!

Google Expands Details on VisualRank – PageRank for Pictures

In April of this year (2008), at the 17th International World Wide Web Conference in Beijing, China, Google researchers presented their findings on an experiment that they performed involving a new way of indexing images which relied to some degree on the actual content of the images instead of things such as text and meta data associated with those pictures.

Our First Look at VisualRank

The paper, PageRank for Product Image Search (pdf), details the results of a series of experiments involving the retrieval of images in for 2000 of the most popular queries that Google receives for products, such as the iPod and Xbox. The authors of the paper tell us that user satisfaction and relevancy of results were significantly improved in comparison to results seen from Google’s image search.

News of this “PageRank for Pictures” or VisualRank spread quickly across many blogs including TechCrunch and Google Operating System, as well as media sources such as the New York Times and The Register from the UK.

The authors of that paper tell us that it makes three contributions to the indexing of pictures:

  1. We introduce a novel, simple, algorithm to rank images based on their visual similarities.
  2. We introduce a system to re-rank current Google image search results. In particular, we demonstrate that for a large collection of queries, reliable similarity scores among images can be derived from a comparison of their local descriptors.
  3. The scale of our experiment is the largest among the published works for content-based-image ranking of which we are aware. Basing our evaluation on the most commonly searched for object categories, we significantly improve image search results for queries that are of the most interest to a large set of people.

The process behind ranking images based upon visual similarities between them takes into account small features within the images, while adjusting for such things as differences in scale, rotation, perspective and lighting. The paper shows an illustration of 1,000 pictures of the painting the Mona Lisa, with the two largest at the center of the illustration being the highest ranked images in a query for “mona lisa”

A Second Look at VisualRank

In the conclusion to PageRank for Product Image Search, the authors noted some areas that they needed to explore further, such as how effective their system might work in real world circumstances on the Web, where mislabeled spam images might appear, as well as many duplicate and near duplicate versions of images.

A new paper from the authors takes a deeper look at the algorithms behind VisualRank, and provides some answers to the problems of spam and duplicate images – VisualRank: Applying PageRank to Large-Scale Image Search (pdf).

The new VisualRank paper also expands upon the experimentation described in the first paper, which focused upon queries for images of products, to include queries for 80 common landmarks such as the Eiffel Tower, Big Ben, and the Lincoln Memorial.

This VisualRank approach appears to still rely initially upon older methods of ranking images which look at things such as text and meta data (like alt text) associated with those images, to come up with a limited number of images to compare with each other. Once it receives those pictures in response to a query, a reranking of those images take place based upon shared features and similarities between the images.

Conclusion

Hopefully, if you have a website where you include images to help visitors experience what your pages are about in a visual manner, you’re now asking yourself how good a representation your picture is of what your page is about.

Being found for images on the web is another way that people can find your pages. And, the possibility that a search engine might include a picture from your page in search results next to your page title and description and URL is a very real one – Google has been doing it for News searches for a while.

How A Search Engine May Use Web Traffic Logs in Ranking Web Pages

By Bill Slawski

A newly granted patent from Yahoo describes how information collected from usage log files from toolbars, ISPs, and web servers can be used to rank web pages, discover new pages, move a page into a higher tier in a multi-tier search engine, increase the weight of links and the relevance of anchor text for pages based upon those weights, and determine when the last time a page has been changed or updated.

Yahoo search toolbar

When you perform a search at a search engine, and enter a query term to search with, there are a number of steps that a search engine will take before displaying a set of results to you.

One of them is to sort the results to be shown to you in an order based upon a combination of relevance and importance, or popularity.

Over the past few years, that “popularity” may have been determined by a search engine in a few different ways. One might be based upon whether or not a page is frequently selected from search results in response to a particular query.

Another might be based upon a count by a search engine crawling program of the number of links that point to a page, so that the more incoming links to a page, the more popular the page might be considered. Incoming links might even be treated differently, so that a link from a more popular page may count more than a link from a less popular page.

Problems with Click and Link Popularity

Those measures of the popularity of a page, based upon clicks in search results and links pointing to that page, are somewhat limited. It’s still possible for a page to be very popular and still be assigned a low popularity weight from a search engine.

Example

A web page is created, and doesn’t have many links pointing to it from other sites. People find the site interesting, and send emails to people they know about the site. The site gets a lot of visitors, but few links. It becomes popular, but the search engines don’t know that, based upon a low number of links to the site, and little or no clicks in search results to the page. A search engine may continue to consider the page to be one of little popularity.

Using Network Traffic Logs to Enhance Popularity Weights

Instead of just looking at those links and clicks, what if a search engine started paying attention to actual traffic to pages, measured by looking at traffic information from web browser plugins, web server logs, traffic server logs, and log files from other sources such as Internet Service Providers (ISPs)?

A good question, and it’s possible that at least one search engine has been using such information for a few years.

Yahoo was granted a patent today, originally filed in 2002, that describes how search traffic information could be used to create popularity weights for pages, and rerank search results based upon actual traffic to those pages, and be used in a number of other ways.

Here are some of them:

  • The rank of a URL in search results might be influenced by the number of times the URL shows up in network traffic logs as a measure of popularity;
  • New URLs can be discovered by a search engine when they appear in network traffic logs;
  • More popular URLs can be placed into higher level tiers of a search index, based upon the number of times the URL appears in the network traffic logs;
  • Weights can be assigned to links, where the link weights are used to determine popularity and the indexing of pages, based upon the number of times a URL is present in network traffic logs; and,
  • Whether a page has been modified since the last time a search engine index was updated can be determined by looking at the traffic logs for a last modified date or an HTTP expiration date.

The patent granted to Yahoo is:

Using network traffic logs for search enhancement
Invented by Arkady Borkovsky, Douglas M. Cook, Jean-Marc Langlois, Tomi Poutanen, and Hongyuan Zha
Assigned to Yahoo
US Patent 7,398,271
Granted July 8, 2008
Filed April 16, 2002

Abstract

A method and apparatus for using network traffic logs for search enhancement is disclosed. According to one embodiment, network usage is tracked by generating log files. These log files among other things indicate the frequency web pages are referenced and modified. These log files or information from these log files can then be used to improve document ranking, improve web crawling, determine tiers in a multi-tiered index, determine where to insert a document in a multi-tiered index, determine link weights, and update a search engine index.

Network Usage Logs Improve Ranking Accuracy

The information contained in network usage logs can indicate how a network is actually being used, with popular web pages shown as being viewed more frequently than other web pages.

This popularity count could be used by itself to rank a page, or it could be combined with an older measure that uses such things as links pointing to the page, and clicks in search results.

Instead of looking at all traffic information for a page, visits over a fixed period of time may be counted, or new page views may be considered to be worth more than old page views.

Better Web Crawling

Usually a search engine crawling program discovers new pages to index by finding links to pages on the pages that they crawl. The crawling program may not easily find sites that don’t have many links pointing to them.

But, pages that show up in log files from ISPs or toolbars could be added to the queue of pages to be crawled by a search engine spider

Pages that don’t have many links to them, but show up frequently in log information may even be promoted for faster processing by a search crawler.

Multi-Tiered Search Indexes

It’s not unusual for a search engine to have more than one tier of indexes, with a relatively small first-tier index which includes the most popular documents. Lower tiers get relatively larger, and have relatively less popular documents included within them.

A search query would normally be run against the top level tier first, and if not enough results for a query are found in the first tier, the search engine might run the query against the next level of tiers of the index.

Network usage logs could be used to determine which tier of a multi-tier index should hold a particular page. For instance, a page in the second-tier index could be moved up to the first-tier index if its URL shows up with a high frequency in usage logs. More factors than frequency of a URL in a usage log could be used to determine which tier to assign a document.

Usage Logs for Link Weights

One use search engines have for link information is to determine the popularity of a document,

The number of incoming links to a page may be used to determine the popularity of that page.

A weight may also be assigned based upon the relationship between words used in a link and the documents being linked to with that link. If there is a strong logical tie between a page and a word, then the relationship between the word and the page is given a relatively higher weight than if there wasn’t. This is known as a “correlation weight.” The word “zebra” used in the anchor text of a link would have a high correlation weight if the article it points to is about zebras. If the article is about automobiles, it would have a much lower correlation weight.

Links could aso be assigned weights (“link weights”) based on looking at usage logs to see which links were selected to request a page. As the patent’s authors tell us:

Thus, those links that are frequently selected may be given a higher link weight than those links that are less frequently selected even when the links are to the same document.

In other words, pages pointed to by frequently followed links could be assigned higher popularity values than pages with more incoming links that are rarely followed.

Link weights Used to Determine the Relevance of Pages for Anchor Text

If a word pointing to a page is in a link (as anchor text), and the link is one that is frequently followed, then the relevance of that page for the word in the anchor text may be increased in the search engine’s index.

For example, assume that a link to a document has the word “zebra”, and another link to the same document has the word “engine”. If the “zebra” link is rarely followed, then the fact that “zebra” is in a link to the document should not significantly increase the correlation weight between the word and the document. On the other hand, if the “engine” link is frequently followed, the fact that the word “engine” is in a frequently followed link to the document may be used to significantly increase the correlation weight between the word “engine” and the document.

Conclusion

This patent was originally filed back in 2002, and some of the processes it covers are also discussed in more recent patent filings and papers from the search engines, such as popularity information being used to determine which tier a page might be on in a multi-tier search engines.

Some of the processes it describes have been assumed by many to be processes that a search engine uses, such as discovering new pages from information gathered by search engine toolbars.

A few of the processes described haven’t been discussed much, if at all, such as the weight of a link (and the relevance of anchor text in that link) being increased if it is a frequently used link, and decreased if it isn’t used often.

It’s possible that some of the processes described in this patent haven’t been used by a search engine, but it does appear that search engines are paying more and more attention to user information that they do collect from places like toolbars and log files from different sources. This patent is one of the earliest from a major search engine that describes how such user data could be used in a fair amount of detail.

Another patent from Yahoo was also granted this week on How Anchor Text can be used to determine the relevancy of a page for specific words. I’ve written about that over on SEO by the Sea, in Yahoo Patents Anchor Text Relevance in Search Indexing