SEM ClubHouse

a Key Relevance blog


Feeds




















Are You an Online Marketer or Just an SEO?

4:59 pm   -   February 27th, 2009

At SES London, Mike Grehan headed up an Orion Panel with Jill Whalen, Brett Tabke, Chris Sherman, Kevin Ryan and Rand Fishkin. The panel was taking a look at “SEO Where To Next”. I’m not going to rehash what went on at the panel, if you’d like a run down Paul Madden did a good summation of it. What I am looking to discuss is our roles, are we just SEO’s, PPC practitioners or affiliate marketers, or, are we online marketers?

What prompts me in asking this, is how in the past 2 years the rise of “Web 2.0″ (I really hate that term) has begun to affect how people consume content, media or anything on the web. Focusing on just SEO, PCC or even Affiliate Marketing, we tend to rely very heavily on the search engines. Heck, we live, die and cry by what Google does. Take a look at the announcement by Matt Cutts about the canonical tag, the search marketing world went nutz!

But what happens when more and more surfers on the internet stop using the typical search engines to find what they need? Confused? Let me explain.

With the advent of the iPhone and its open application system, you no longer need to go to Google to find a nearby restaurant. That’s right, iPhone users have a bevvy of applications that connect them to the internet without a browser and without going to Google and getting a map with a list of restaurants. OpenTable will tell you which restaurants near you have available seating, Urban Spoon does just about the same thing.

It’s not just the iPhone either, AccuWeather just launched a nice little widget much better than than the dreaded desktop “WeatherBug” app(that adds those dreaded tracking cookies that Norton catches). Through the slick Adobe Air backend, AccuWeather tells me my weather without opening a browser and typing in “Weather 19468″. There’s also a nice AdobeAir Application called Tweetdeck to help you manage Twitter, never having to connect to a browser to hold a relevant conversation.

Facebook and Myspace both have phone applications for iPhone, Blackberry or just about any smart phone out there. It’s becoming easier and easier to connect to the internet and the sites you want, and to find the things you want without using a browser or even a search engine.

So with that in mind, I posed this question to the panel. With the ability to connect to the internet w/o a browser, is it the SEO’s job to still work with these types of applications? Only one panel member answered, bravely, Rand Fishkin said he didn’t believe this was the SEO’s job.

I agree, to a point. If you define yourself as an SEO who just optimizes web pages or websites, then yes, he’s right.

But if you have an eye on the future of marketing and are seeing what new technologies are emerging and being embraced in our world, I have to disagree with Rand, in that, that view is really limiting. Businesses are going to have to embrace moving even beyond just the typical web page for an online presence. Search Engines aren’t just browser based anymore, the OpenTable application demonstrates that to a “T”. As responsible online marketers, we have to look beyond just websites and Google, we have to look at the entire online presence, and move beyond the thought that SEO means web based search engines because it doesn’t. So are we SEO’s or Online Marketers, or perhaps both? I guess in the end its how you define “SEO”.

That leads me to wonder this question, is the holy grail of search - the “Google Killer”, just going to be the inevitable change of end user habits? Interesting thought isn’t it? :)

Domain Moving Day the Key Relevance Way

2:21 pm   -   October 17th, 2008

Domain Moving Day the Key Relevance Way

by Mike Churchill

So, you’re gonna change hosting providers. In many cases, moving the content of the site is as easy as zipping up the content and unzipping it on the new server. There is another aspect of moving the domain that many people over look: DNS.

The Domain Name System (DNS) is the translation service that converts your domain name (e.g. keyrelevance.com) to the corresponding IP address. When you move hosting companies, it’s like changing houses, if you don’t set up the Change of Address information correctly, you might have some visitors going to the old address for a while. Proper handling of the changes to DNS records makes this transition time as short as possible.

Let’s assume that you are changing hosting, and the new hosting company is going to start handling the Authoritative DNS for the domain. The first step is to configure the new hosting company as the authority. This should best be done a couple or more days before the site moves to the new location.

What does “Authoritative DNS” mean?
There are a double-handful of servers (known as the Root DNS servers) whose purpose is to keep track of who is keeping track of the IP addresses for a domain. Rather than them handling EVERY DNS request, they only keep track of who is the authoritative publisher of the DNS information for each domain. In other words, they don’t know your address, but they tell you who does know it.

If we tell the Root level DNS servers that the authority is changing, this information may take up to 48 hours to propagate throughout the internet. By changing the authority without changing the IP addresses, then while visiting browsers are making requests during this transition, both the old authority and the new authority will agree on the address (so no traffic gets forwarded before you move).

Shortening the Transition
The authoritative DNS servers want to minimize their load, so every time they send out an answer to a request address for a given domain, they put an expiration date on it. This is called the “Time To Live”, or TTL. By default, most DNS servers set the domain TTL to 14400 seconds , which equals 1 day. Thus, when a visitor requests the address of the authoritative DNS, it returns the IP address and says “and don’t bother asking again for 24 hours.” This can cause problems during the actual transition, since the old address might continue to be accessed for a whole day after the address has changed.

The Day Before the Move
Since the new hosting company is the authority, they can shorten the TTL to a much shorter value. We recommend that 15 minutes (900 seconds) is a good compromise TTL value during the transition time.

Moving Day
When you are ready to make the switch, have the new DNS servers change the IP information to the new address(es). Since the TTL was set to 15 minutes, very quickly the other DNS servers on the ‘net will be asking for the IP address of the domain. They will be provided with this info, and the switchover will happen much more quickly than if the authority had not changed. Once the new site is live and you have verified nothing is horribly wrong, you can change the TTL on the new DNS servers back to 1 day. If on the other hand, something IS wrong with the new site, you can change the DNS back to the old IP address and within 15 minutes most if not all traffic should be back to the old servers. We also recommend changing the old DNS info to point to the new IP address as a precaution, but if you follow these steps, most of the traffic should have already trasnsitioned to the new DNS servers.

A Bug in BIND
There is a bug in some versions of the BIND program (which executes the DNS translation). This bug will cause a DNS server to continue to ask the same authoritative DNS server for the info as long as he is willing to give it. To complete the transition cleanly, you need to turn the DNS records for the domain off in the old DNS servers. This will cause it to generate an error, which in turn will cause the requesting DNS server to ask the Root level servers for the new authority. Until you make this change, there is still a chance that some traffic will continue to visit the old domain.

Change of Address Forms
The USPS offers a Change of Address kit to help make moving your house easier. Below is the Key Relevance Change of Address Checklist that may make you site’s transition painless.

 

 

 

Key Relevance Domain Change of Address Checklist

2+ Days Pre-Move
Set up new DNS servers to serve up the OLD IP addresses

  • - handle old subdomains
  • - handle MX records

Once that is complete, Change Authoritative DNS records to point to new DNS servers

1 Day before move
On new DNS servers, shorten TTL to 15 min (900 sec)

Moving Day
On New DNS Servers

  • - Change IP Addresses to new server
  • - Change TTL to 1 day (86,400 sec), or whatever the default TTL is once you are sure all is OK

On Old DNS Servers

  • - Change IP Addresses to new server to catch DNS stragglers

2 Days Post Move (or when convenient)

  • - Remove DNS records from OLD DNS servers (assuming they are still up)

I’m A Social Media Goody 2 Shoes … And Proud Of It

6:16 pm   -   July 24th, 2008

By Liana “Li” Evans

Goody Two Shoes Comic by Flickr User ebbourg So yet another controversy when it comes to social media. I woke up to a plether of IM’s, Private Tweets and emails, to find out I’m a “Goody2Shoes”. I guess I could be upset, but I’m not. It’s par for the course in the world of Search these days. I could lash out at SEOMoz, because as many have pointed out, they let a post go to their own blog that attacks competitors (It has now been edited, but point being they originally let it out with the rather rude attacks on Matt Cutts, Lisa Barone and myself). I’ll let all those comments on the post speak for themselves. I’m sad that SEOMoz chose the path of inciting drama and discourse, but in the end that’s Rand’s business decision where to take his business, not mine. The drama gets the site links, and traffic, and I guess that trumps everything.

As for what Marty wrote about both Lisa Barone and myself and choosing to post it on SEMoz rather than taking ownership for it on his own blog, I can only guess he really needed the larger audience for the message he wanted to convey. I read Marty’s apology, “Lii and Lisa are pillars in this community…”, while I’d like to think it’s genuine, I was on the panel in Toronto, where I heard his example of vanity baiting in his presentation, I can’t help to think and question that this might another example of it.

As for my stance, I also guess when you take a position that fake profiles on StumbleUpon, and adding lots “fake” friends to make yourself look more popular, is not a sound strategy for entering the social media space, undoubtedly you’ll get flack, from those who find no flaws with this strategy. It happens, we all have different moral compasses, we all have different things that drive us to be what we perceive as a “great marketer”.

When I was taken aback by the tactics my co-panelists in Toronto presented and posted about it, I wanted to make sure I wasn’t off base. I asked a few people who just use social media without any knowledge of search or marketing what they thought of these tactics. The first person I asked as a 14 year old son of a friend who is an avid MySpace user. I asked him what he thought about adding all these famous people as friends, his reply was just one word “Lame“. I asked a friend I hang out at karaoke with the same question, her reply was “that’s just stupid, why would you friend them unless you liked them?

Next I asked a few people who I know use StumbleUpon for pure enjoyment, they have no marketing background, what they thought about people building fake “avatars”, or “fictitious profiles” on the service (btw, that’s a blatant violation of StumbleUpon’s TOS). My one friend from the EU said, “isn’t that illegal here?” (only illegal in the UK, sorry to say), another said “people do that? why in the world do they do that, that’s just crazy, and wrong, can’t they be honest?

Now if everyday people (not marketers) are saying this about these strategies, why would I advise my clients to implement those strategies? I wouldn’t and I wouldn’t promote doing this in a session at a major online marketing conference. I don’t see how creating fake profiles (or avatars) gains anyone any kind of ground in the end, when you are found out to be a fraud, all trust is lost.

What’s wrong with being honest? Really now, what’s wrong with starting a conversation, and honest one with real brand representatives, not one greeted immediately by fake/automated avatars that want to be my friend?

The only reason I can understand why SEO’s seem so fascinated with “gaming” social media by creating fake avatars and adding all these “non-friends” is for power and links. That’s really not what social media is about, not to the people inside the communities - only to SEO’s does this seem to matter.

If advocating that in social media, marketers be real, engage honestly in conversations with an audience or their customer, is deemed as “Goody 2 Shoes”, well I’ll gladly, and proudly wear that badge.

*****

Now, I don’t know about you, but all this reference to Goody 2 Shoes, I really can’t get Adam Ant’s 80’s tune out of my head. :)

Photo/Comic Credit: ebbourg

Matt McGee Joins Our KeyRelevance Team

8:31 am   -   July 2nd, 2008

By Li Evans

Matt McGeeWe’ve got some exciting news here at the SEMClubhouse. Another great SEO mind has joined not just the clubhouse, but the KeyRelevance staff as well.

Matt McGee of the Small Business Search Marketing Blog, joined our team yesterday!

With companies needing to stretch their marketing dollars, adding Matt McGee, who specializes in working with clients to maximize the return on their online marketing investment, was a great expansion to the KeyRelevance team.

“Google’s Universal Search changed the rules of online marketing,” said Christine Churchill, President and CEO of KeyRelevance. “Search engine optimization still rules, but now it’s the tip of the iceberg of what we need to provide to clients. Online marketing now encompasses not only SEO and Pay Per Click, but blog and video optimization, local and mobile search, social media marketing, and much more. Matt’s specialized knowledge in these areas makes him a valuable addition to our already robust team.”

KeyRelevance’s online team includes well-known SEOs Bill Slawski (SEO By The Sea), Li Evans (Search Marketing Gurus), Jim Gilbert, and now Matt McGee.

“In addition to being a first class SEO, Matt is one of the most positive people I’ve ever known,” added Churchill. “He infuses the element of fun into the workplace.”

A seasoned marketer, Matt has been online since 1994. Matt is a regular speaker at major search industry conferences including Search Engine Strategies, Search Marketing Expo, and Small Business Marketing Unleashed. He is also a columnist at Search Engine Land. In his spare time, Matt runs the Small Business Search Marketing blog and one of the oldest and largest independent U2 sites on the Internet at @U2.com.

“KeyRelevance is one of the most respected companies in the search marketing industry, and it’s an honor to join a team with such impeccable credentials. I’ve known Christine, Li, and Bill for years as friends and peers. I’m excited to join them and the rest of the KeyRelevance crew,” says McGee.

What Is Social Media’s Purpose? Honestly, It’s Not About Links

9:27 pm   -   June 20th, 2008

By Li Evans

What do you use social media for?

Do you use it to gain links? How about power? Maybe to trick people into thinking you are someone else? Perhaps as leverage to con someone into doing something on another social media site for you?

HonestyAt SES Toronto I was on the Social Media Success panel. I took this panel very seriously, I wanted to demonstrate how companies are using social media and creating their own success stories. The companies I chose to highlight wanted active conversation, true audience engagements and honest reviews and because they took that approach they had incredible success. I believe with every ounce of my being, social media is about conversations and sharing. I have a huge issue with applying shady link acquisition tactics, power manipulation and common trickery to social media.

There are people in the search industry that think social media is a numbers game, a numbers game that involves links. On the panel there were things presented that made my jaw drop, basically “shady” techniques, things like adding friends just for the numbers, creating multiple profiles, vanity baiting, and using your power on one social media site to gain something on another. To my colleagues on the panel, social media was all about the links and perceived power. Success to them in social media seemed to be about how many links you acquired, and what seemed to be cheap and fast tricks to get them.

I wasn’t alone in my dismay, Rahaf Harfoush expressed her shock at the lack of ethics presented.

People in the search industry wonder why SEO gets the stigma of being the “snake oil salesmen”. People in the search industry wonder why big companies are snubbing SEO, and don’t even look to SEO practitioners for Social Media assistance. Well when you try to apply SEO practices to social media wherein you are using it to gain links alone, or try to manipulate people into thinking things are true that aren’t, that’s how that reputation emerges, and the snubbing occurs.

Social Media is not about links.

Honesty is the Best PolicySocial Media is about conversations and the opportunity to share experiences through those conversations. Links are merely a by-product of a great social media campaign, and search engine rankings are merely a by-product as well. If you are measuring success in social media by the number of links you’ve acquired, you are really and truly missing out on what social media is all about.

What’s going to happen when Google finally devalues links from websites and looks more and puts more weight into what’s going on in social media? Social media offers so much more opportunity for the general public to voice their opinions about brands, products, companies and their opinion of what is really relevant, more so than a meager link from a website. Think of it this way, more people on the internet today participate in social media, than own a website. Guess what? These people are actively telling Google, Yahoo and MSN what they think is relevant by rating, commenting and participating in social media.

No fake profile, or adding friends, or using your “perceived power” is going to be able to easily change this, once it comes.

Remember, those discussions that are happening in social media channels, happen whether you are actively engaged in that conversation or not. So wouldn’t your time be better spent involving yourself with those conversations actively? Or would it be better spent adding a ton of fake friends to MySpace, conning a top Digg user into submitting your link for exchange of Wikipedia article help, or creating fake profiles on StumbleUpon?

Use social media for true customer engagements, be transparent, be honest, be who you are. People want to interact with real people from companies, they want Truth in Marketing. They want to tell stories about how great your employees are, what kind of heart you have and how you care about your customers and audience. The audiences couldn’t give a damn about your links, or how many sock puppet accounts you have.

Maybe when the search industry stops thinking of links first with social media, they will be taken a bit more seriously in the online marketing arena.

Yahoo on Web Mining and Improving the Quality of Web Sites

8:59 am   -   March 18th, 2008

by Bill Slawski

A successful web site is one that fulfills the objectives of its owners and meets the expectations of the visitors that it was created to serve.

This is true of ecommerce web sites, news and informational sites, personal web pages, and even search engines. And, it’s a topic that even the search engines are exploring more deeply. A recent patent application from Yahoo tells us that:

The Web has been characterized by its rapid growth, massive usage, and its ability to facilitate business transactions. This has created an increasing interest for improving and optimizing websites to fit better the needs of their visitors. It is more important than ever for a website to be found easily on the Web and for visitors to reach effortlessly the content for which they are searching. Failing to meet these goals can mean the difference between success and failure on the Internet.

User Query Data Mining and Related Techniques, (US Patent Application 20080065631), by Ricardo Alberto Baeza-Yates and Barbara Poblete.

The patent filing discusses how information about queries that people use, collected from search boxes on a site (if one is used) and from search engines bringing people to a site, can provide useful and helpful information about how people use that site.

The collection of this kind of information is often referred to as Web Mining, and looking closely at the words people use to find information on a site can tell us something about the actual information needs of those visitors.

Search engines have studied searchers’ queries mostly to try to make search engines work better, but looking at the words people use to find a site, and to search within it once they have found it, could help to make the web sites themselves better.

The abstract of Yahoo’s patent filing notes:

Methods and apparatus are described for mining user queries found within the access logs of a website and for relating this information to the website’s overall usage, structure, and content. Such techniques may be used to discover valuable information to improve the quality of the website, allowing the website to become more intuitive and adequate for the needs of its users.

One tool that many site owners use on their pages are analytics programs, though often those are looked at to see how much traffic is coming to a site, and possibly to determine which words people are using to find a site. Analytics programs can provide a stronger role in helping people with web sites improve the experience of people visiting their pages, and the success of their sites.

Web Mining

The Yahoo patent is interesting in that it focuses less on how a search engine works, and more on how the owners of web sites can use the process of Web mining to discover patterns and relations in Web data. Web mining can be broken down into three main areas:

  • Content mining,
  • Structure mining, and;
  • Usage mining.

These relate to three kinds of data that can be found on a web site:

  • Content — the information that a web site provides to visitors such as the text and images and possibly video and audio, that people see when they come to a site.
  • Structure data — this is information about how content is organized on a site, such as the links between pages, the organization of information on pages, the organization of the pages of the site itself, and the links to pages outside of the site.
  • Usage data — this information describes how people actually use the site, and may be reflected in the access log files of the server that the site is on, as well as data collected from specific applications on the site, such as people signing up for newsletters or registering with a site and using it in different ways.

Knowing which pages people visit and which pages people don’t can be helpful in figuring out if there are problems with a site. They can uncover a need to rewrite pages, or to reorganize links, or make other changes.

Mining User Queries

Understanding query terms used to find a site and to search on the site can help improve the overall quality of a site. Yahoo’s approach would be to create a model to use to understand how people are accessing a site, and navigating through it:

According to specific embodiments of the invention, a model is provided for mining user queries found within the access logs of a website, and for relating this information to the website’s overall usage, structure, and content. The aim of this model is to discover valuable information which may be used to improve the quality of the website, thereby allowing the website to become more intuitive and adequate for the needs of its users.

This model presents a methodology of analysis and classification of different types of queries registered in the usage logs of a website, including both queries submitted by users to the website’s internal search engine and queries from global search engines that lead to documents on the website. As will be shown, these queries provide useful information about topics that interest users visiting the website. In addition, the navigation patterns associated with these queries indicate whether or not the documents in the site satisfied the user’s needs.

Queries uncovered might be related to categories drawn from such things as navigational information found on a site.

Traffic through the site could tell someone using this invention how effective the site was at meeting the information needs of the people using certain queries. It could also provide suggestions for:

  1. The addition of new content
  2. Changes or additions in words found in anchor text in links
  3. New links between related documents
  4. Revisions to links between unrelated documents

Information Scent

Visitors to a site will follow links that use words within the links that provide some level of confidence that the information being looked for will be upon the other side of those links (The Right Trigger Words as User Interface Engineering’s Jared Spool calls them). Likewise, when someone searches at a search engine, and sees a page title and a snippet of text for a site in search results, the words used in the title and snippet may persuade someone to visit the page. This is true both for search results from a search engine, and search results from an internal search for a specific site.

Understanding what kind of information is being searched for regarding a specific query, and how the words used in search results, on web pages, and in links to other pages may provide some insight into making those search results, those pages, and that anchor text better.

The patent application describes how pages and the queries used to reach them can be classified based upon how they are typically used by a visitor - from external searches through a search engine, from internal searches through a web site search, or through navigation on the site itself.

It also classifies queries as successful or unsuccessful, based upon things such as whether someone visited a page in response to the display of a search result showing the page, or if they followed other links on pages visited to explore a site in more depth.

Seeing how pages are typically reached on a site in response to certain queries, and seeing which queries are successful and unsuccessful in bringing people to information that they want to find can help a site owner make positive changes to a site.

Example

The patent application provides an example using a portal targeted at university students and future applicants.

It focuses upon exploring how effective the site is when searchers use the queries “university admission test” and “new student application” in searches for the site both on search engines and on a site search for the site. Two initial reports evaluated how effective the site was without making any changes. Twenty of the top suggestions generated from reviewing the model described in this patent application were incorporated into the site’s content and structure:

The suggested improvements were made mainly to the top pages of the site and included adding Information Scent to link descriptions, adding new relevant links, and suggestions extracted from frequent query patterns, and class A and B queries.

Other improvements included broadening the content on certain topics using class C queries, and adding new content to the site using class D queries. For example the site was improved to include more admission test examples, admission test scores, and more detailed information on scholarships, because these were issues consistently showing up in class C and D queries.

The “class C” queries mentioned are ones where there was very little information available on the pages of the web site. The “class D” queries were ones for which there was no information available on the site.

One significant result of these changes showed an increase in traffic from external search engines of more than 20%, due to improvements in content, and in link text.

Conclusion

It’s interesting that a search engine would apply for a patent that explores how to use data mining to improve the quality, content, and navigation of a web site. It’s difficult to tell what Yahoo might do with the method describe in this patent application - whether they will only use it internally, or will offer it to others for a fee, or for free.

Many of the concepts described in this patent application are ones that site owners can presently use to improve how well their site meets their objectives, and the objectives of people visiting their pages.

Understanding the terms that people will try to use to find your pages, and the words and concepts that they expect to see on the pages of your site can make a difference in how successful your site may be.

Using analytics tools to understand how visitors who use certain queries will explore your pages and navigate from one page to another can provide even more value to both searcher and site owner, by pointing out changes that can be made to improve the experience of those visitors.

And those changes may just lead to more visits from search engines.

Google Says Users Won’t be able to Tell Paid Ads from Natural

5:26 pm   -   March 10th, 2008

by Jim Gilbert

By Scott Morrison, Of DOW JONES NEWSWIRES reports a top Google executive (Tim Armstrong, Google’s North American president for advertising and commerce.) of saying:

“Speaking at the Bear Stearns Media Conference in Palm Beach, Fla., Armstrong said Google’s advertising platform will evolve over time so that it won’t distinguish between search and display ads.”

Anyone care to comment on what the heck that means?

Advances in Crawling the Web

2:33 pm   -   February 29th, 2008

By Bill Slawski

There are 3 major parts to what a search engine does.

The first is crawling the Web to revisit old pages and find new pages. The second is taking those crawled pages, and extracting data from them to index. The third part is presenting pages to searchers in response to search queries.

There’s been some interesting research published recently on the first of those parts.

Crawling Challenges

Crawling the Web to discover new pages, and identify changes in pages that a search engine already knows about can be a challenge for a search engine. The major issues that search engines face in crawling sites involve:

  • How many pages they can crawl without becoming bogged down,
  • How quickly they can crawl pages without overwhelming the sites that they visit, and;
  • How much resources do they have to use to crawl and then revisit pages.

A search engine needs to be careful on how it spends its time crawling web pages, and choosing which pages to crawl, to keep these issues under control.

A recently published academic paper describes this important aspect of how a search engine works, the Web crawl, in more detail than most papers that have been published on the subject before.

Enter IRLBot at Texas A&M

The Department of Computer Science at Texas A&M University has been running a long term research project know as IRLBot which “investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web.”

In April, researchers from the school will be presenting some of their recent reseach in Beijing, China, at the 17th International World Wide Web Conference (WWW2008).

The title of their presentation is IRLbot: Scaling to 6 Billion Pages and Beyond (pdf), and the focus of the paper is this primary function that a search engine performs - crawling the Web and finding new Web pages.

Their research describes some interesting approaches to finding new pages on the Web, handling web sites with millions of pages, while also avoiding spam pages and infinite loops that could pose problems to web crawlers.

In a recent experiment that they performed that lasted 41 days, their crawler “IRLbot” ran on a single server and “successfully crawled 6.3 billion web pages at an average download time of approximately 1,789 pages per second.” This is a pretty impressive feat, and it’s even more impressive because of some of the obstacles faced while finding those pages.

Problems Facing Crawling Programs

Politeness

One challenge that faces Web crawling programs is that those programs shouldn’t ask for too many pages from the same site at a time, or they could use up too many resources of the site and make the site inoperable. Keeping that from happening is known as politeness, and search crawlers that aren’t polite often find themselves blocked by site owners, or complained about to the internet service provider hosting the crawler.

URL Management

As a crawling program indexes a site, it needs to pay attention to a file on the site known as a robots.txt file, which provides directions on pages and directories that the crawling program shouldn’t visit, so that it doesn’t crawl pages that it isn’t supposed to see. The program also needs to track which pages it has seen, so that it doesn’t try to crawl the same pages over and over again.

Avoiding Spam While Crawling

The crawling process described in the paper also tried to limit the crawling program from accessing pages that might more likely be spam pages. If a crawling program spends a lot of its time on spam pages and link farms, it has less time to spend on sites that may be more valuable to people searching for good results to their queries at search engines.

One key to the method used by this research team in determining how much attention a site should get from their web crawler was in looking at the number of legitimate links into the site there are from other sites, which is what they refer to as domain reputation.

Why This Paper is Important

The authors of the paper tell us that there are:

…only a limited number of papers describing detailed web-crawler algorithms and offering their experimental performance.

The paper provides a lot of details on the crawling process and the steps that the Texas A&M researchers took that enabled them to index multi-million page web sites, avoid spam pages, and remain “polite” while doing so. It explores the experiments that they conducted to test out ideas on how to handle very large sites, and crawls of very large amounts of pages.

They conducted their experiment using only a single computer. The major commercial search engines have considerably more resources to spend on crawling the web, but the issues involving managing which pages they choose to index, being polite to sites that they visit, and avoiding spam pages are problems that commercial search engines face too.

Learning about how search engines may crawl pages can help us understand how a search engine might treat individual sites during that process. If you are interested in learning about the web crawling process in depth, this paper is a good one to spend some time reading.

The Web in the World - Looking for URLs Offline

4:10 pm   -   January 31st, 2008

By Bill Slawski

If you run a business, and own a web site, it’s not a bad idea to include the address of your site on your invoices, your business cards, within the letterhead of your stationary, and other paperwork that comes out of your office. You may even want to include that URL on shipping boxes, on your business sign, and in other places where the address might be visible.

Every few months, I like to take a walk through the small town I live in, with a pen and notepad in hand, and look for web addresses in places that I haven’t seen them before. On a normal day, I don’t think that I pay too much attention to how the Web and the world interact on a stroll through town, but I see some surprises when I start looking more closely.

My town is a University town, and most of the students are away on winter break, which made this morning quieter than it is when school is in session.

I start searching for URLs as soon as I get out of my front door, and the first one that I see is in a nearby parking lot. There’s a Marine recruiting station close by, and a number of recruiters’ cars in the lot. A number of them had written across their sides and back the Web address “marines.com” and “1-800-marines.”

As I walk past them, I decide to stop for a cup of coffee at one of the local coffee shops. Next to the credit card logos on the door of the shop is a small sign advertising a University meal plan. Students can pay for a card which they can use to buy food at different eating establishments in town, and these signs let them know which ones accept that meal card. It also acts as an advertisment for students, so that they can find out more about the program, and the URL is shown so that they can find out more about the service.

I grab the local paper while I’m getting my coffee, and start looking through it for Web addresses. A front page banner ad, below the fold, looks more like it was designed for a web portal than news print. Appropriately, it advertises a web site.

Turning through the pages of the newspaper, I’m starting to see ads that don’t carry a street address or a phone number - just a URL. I wonder how many of them are actually local businesses, and how many are located somewhere else. The advertisements are for items that could be anywhere in the world.

I finish looking through the paper, and and leave the coffee house onto Main Street, when a bus passes by. I expect a URL on the bus, and don’t see one. I’ve seen their schedules online, so I’m surprised that they don’t include their web address next to their name.

A sticker from a local band, pasted on a utility pole catches my attention, and it provides a URL for their MySpace page. Another sticker, sloppily attached to a mail pickup box a little further down the street is for the state National Guard, and shows their toll free phone number, but not a web address.

A sign at the post office provides a list of dates that the the office will be closed, but tells us that “We’re always open at usps.com.” I’ve been wondering why they didn’t choose the name “mail.gov.”

A paper company truck is stopped on Main Street, to make a delivery, and the side of their truck is a billboard for their goods. Under the sentence where they tell us that they’ve been around since 1919 is the URL for their business.

As I return home, I notice that I’ve received my mail, and on the back of one of the envelopes, I see a message that I can pay my bills online, along with a URL. I’m not sure if I’ve seen a envelope with both web address, and a call to action like that before.

I think I’ve seen more URLs on this walk than I’ve seen in previous trips through town. There are a few on business signs, and on posters in store windows, and in notices posted on the community bulletin board. Next time I try this, I’m going to have to take my phone with me, and see how well those show up on a screen for handhelds.

Google Maps - Now includes Terrain

8:33 am   -   November 28th, 2007

by Mike Churchill

On 27 November, 2007, Google released an update to Google Maps: they are now including terrain as an optional view. This is especially cool for high-relief areas (mountains, hiking trails, and the like). For example, here are three views of the Grand Canyon:

The Google Maps 'Map' View of the Grand Canyon

The ‘Map’ view is pretty boring, and other than showing the size of the National Park and its boundaries, does little to convey the grandeur of the location.

The Google Maps 'satellite' View of the Grand Canyon

The ‘Satellite’ view gives a better overview, but rather than helping, the various colors of the real terrain create a confusing image.

The Google Maps 'terrain' View of the Grand Canyon

The new ‘terrain’ view gives the best impression of the feel of the location: the deep rift is clearly visible and the correlation between the valley and the park boundaries are clear. In addition, there are labels identifying certain landmarks. In city locales, the terrain view shows large buildings as well.

This new feature comes at a cost, however: while the terrain view is new, the ‘hybrid’ view which displayed the satellite imagery with the roads overlayed is now a sub-option under the ’satellite’ view. Choose the ’satellite’ view, and a “Show Labels” checkbox becomes available when hovering over the satellite button. Selecting the checkbox will generate the hybrid view. The hybrid view shows vegetation and other non-geological features, so the two views offer complementary insight into certain areas.

Next Page »