Advances in Crawling the Web

Posted on February 29, 2008

By Bill Slawski

There are 3 major parts to what a search engine does.

The first is crawling the Web to revisit old pages and find new pages. The second is taking those crawled pages, and extracting data from them to index. The third part is presenting pages to searchers in response to search queries.

There’s been some interesting research published recently on the first of those parts.

Crawling Challenges

Crawling the Web to discover new pages, and identify changes in pages that a search engine already knows about can be a challenge for a search engine. The major issues that search engines face in crawling sites involve:

How many pages they can crawl without becoming bogged down,
How quickly they can crawl pages without overwhelming the sites that they visit, and;
How much resources do they have to use to crawl and then revisit pages.

A search engine needs to be careful on how it spends its time crawling web pages, and choosing which pages to crawl, to keep these issues under control.

A recently published academic paper describes this important aspect of how a search engine works, the Web crawl, in more detail than most papers that have been published on the subject before.

Enter IRLBot at Texas A&M

The Department of Computer Science at Texas A&M University has been running a long term research project know as IRLBot which “investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web.”

In April, researchers from the school will be presenting some of their recent reseach in Beijing, China, at the 17th International World Wide Web Conference (WWW2008).

The title of their presentation is IRLbot: Scaling to 6 Billion Pages and Beyond (pdf), and the focus of the paper is this primary function that a search engine performs – crawling the Web and finding new Web pages.

Their research describes some interesting approaches to finding new pages on the Web, handling web sites with millions of pages, while also avoiding spam pages and infinite loops that could pose problems to web crawlers.

In a recent experiment that they performed that lasted 41 days, their crawler “IRLbot” ran on a single server and “successfully crawled 6.3 billion web pages at an average download time of approximately 1,789 pages per second.” This is a pretty impressive feat, and it’s even more impressive because of some of the obstacles faced while finding those pages.

Problems Facing Crawling Programs

Politeness

One challenge that faces Web crawling programs is that those programs shouldn’t ask for too many pages from the same site at a time, or they could use up too many resources of the site and make the site inoperable. Keeping that from happening is known as politeness, and search crawlers that aren’t polite often find themselves blocked by site owners, or complained about to the internet service provider hosting the crawler.

URL Management

As a crawling program indexes a site, it needs to pay attention to a file on the site known as a robots.txt file, which provides directions on pages and directories that the crawling program shouldn’t visit, so that it doesn’t crawl pages that it isn’t supposed to see. The program also needs to track which pages it has seen, so that it doesn’t try to crawl the same pages over and over again.

Avoiding Spam While Crawling

The crawling process described in the paper also tried to limit the crawling program from accessing pages that might more likely be spam pages. If a crawling program spends a lot of its time on spam pages and link farms, it has less time to spend on sites that may be more valuable to people searching for good results to their queries at search engines.

One key to the method used by this research team in determining how much attention a site should get from their web crawler was in looking at the number of legitimate links into the site there are from other sites, which is what they refer to as domain reputation.

Why This Paper is Important

The authors of the paper tell us that there are:

…only a limited number of papers describing detailed web-crawler algorithms and offering their experimental performance.

The paper provides a lot of details on the crawling process and the steps that the Texas A&M researchers took that enabled them to index multi-million page web sites, avoid spam pages, and remain “polite” while doing so. It explores the experiments that they conducted to test out ideas on how to handle very large sites, and crawls of very large amounts of pages.

They conducted their experiment using only a single computer. The major commercial search engines have considerably more resources to spend on crawling the web, but the issues involving managing which pages they choose to index, being polite to sites that they visit, and avoiding spam pages are problems that commercial search engines face too.

Learning about how search engines may crawl pages can help us understand how a search engine might treat individual sites during that process. If you are interested in learning about the web crawling process in depth, this paper is a good one to spend some time reading.

Retailers & Blended / Universal Search

Posted on February 28, 2008

by Liana "Li" Evans

By Li Evans

This past week I presented at SMX West in Santa Clara, California on the Retail and Blended Search panel. It was quite interesting to be on a panel that also included representatives from both MSN and Yahoo Shopping divisions. There was a lot of information given, from making sure your images had feeds to looking beyond feeds for promotion of online retail products.

As we progress further and further with technology and the availability of broad band to shoppers, searchers are looking for more than just a blue link on a search results. Searchers are becoming more savvy as technology progresses, and as more and more options are being provided to them, they actually WANT more than just a blue link. So where does that leave the retailers on the web who have invested so much in feeds?

Retailers need to start thinking outside of the box, because if they merely rely on a feed to get the traffic to your page, eventually as the search results become more engaging. Retailers who just rely on the feed links will loose out on all those people clicking on video links, picture links, social media profiles, and reviews. So what’s a retailer to do?

Images:

Make sure your products have images.
Make your put captions underneath image.
Make sure your images folder is accessible to the search engines.
Name your images properly.
Make sure your images are of good quality.

Google actually shows different images in blended search than it does in regular image search. The thing to remember with images is that shoppers are very visual and if you have the opportunity to take advantage of image search why not put your best foot forward?

Videos:
Here’s a perfect opportunity to engage consumers via a social media medium. It’s visual, its interesting and it engages a customer into finding out more information. Utilize a few of the social video sites, by uploading some short videos of product demonstrations, humor takes, or even “how to” videos. If you make them fun and interesting, there’s even a chance for them to become viral and while not directly on your site, if the description is optimized with a link to your site or the page the product is on, this can be another traffic driver, beyond the search engine.

Rating & Reviews:
If you can start reviews of your products on your site, this could be a powerful resource to help raise the quality of the page. Amazon uses this very wisely and to their advantage. Rating & review sits such as Epinions and Yelp, also hold a lot of value and can help with with yet another way to “indirectly” hold another position in the results.

Social Media Profiles:
Having profiles on various – relevant – social media sites are another way to help bring awareness to your brand, as well as your products or services. People link to social media profiles, so just like with rating and review sites, it is possible to own another spot on the SERPs in an indirect manner. Make sure that your profile on all of the social media site you belong to is properly filled out with the right URL, emails, contact information, etc.

These are just a few ways retailers can broaden their reach, beyond the regular product feed. Starting to think beyond the feed and planning a full online marketing strategy will open a lot more opportunities for retailers in the new realm of blended / universal search.

A Chat with Analytics Guru Jim Sterne

Posted on February 15, 2008

by Editor

By Christine Churchill

SES London 2008 is nearly upon us. Looking over the agenda and speaker list I was happy to see Analytics expert and all around good guy Jim Sterne. I’ve known Jim for a couple of years and I continue to be in awe of the man.

Jim Sterne Picture Jim is a prolific author of books and articles, a famed speaker, and the producer of the eMetrics Summit Conferences. Energetic and engaged in life, Jim runs Target Marketing and is the Founding President and current Chairman of the Web Analytics Association, a wonderful organization of which KeyRelevance is a proud Premiere Corporate Member.

Jim and I will both be in London next week speaking at SES London. Jim is legendary on stage and I’ve heard other speakers playfully call him the PowerPoint King. He’s one of those rare people you occassionally meet in life who exude positive energy and great ideas. I caught up with Jim the other day and asked him a few questions.

Christine: Most of us running online businesses are going in many directions and have to prioritize where we spend our time. Here’s a question for all those harried business owners who are trying to make every minute count. If I only had 15 minutes a day to spend on analytics, where should I spend my time?

Jim: That would depend on my goals. If my obsession for the day were in conversions, then I would spend my time looking over the persuasion path to see where I could improve the visitor interaction. If my goal was to sell more advertising, I would be measuring what makes people look at more pages so I can display more ads. If my goal is to bring in more qualified leads, I would watch how well my advertising money is being spent – where are people coming from and are they the sort of people I’m after?

Christine: Now here’s a related question. When evaluating a new site, what’s the first measurement you would look at?

Jim: It completely depends on the goals of the website. The first thing would be to get the basic numbers, just as a benchmark. How many visitors? How many events (what we used to call pageviews) per visit? How often do they come back? That way, we have something we can compare with tomorrow’s numbers.

Christine: Since both of us will be in London next week, let me as a question related to analytics and geography. Key Performance Indicators (KPIs) are important measurements that companies track. Do you see any regional or national differences in KPIs?

Jim: None at all. KPI’s are particular to industries and website types. Even the ability to track those KPI’s is the same. There are very smart companies in every corner of the globe and some of the most advanced places harbor the least capable companies.

Christine: That is excellent news for businesses with global aspirations and your answer makes perfect sense. Jim, you are well known for your clear thinking and forward looking approach to life. Where do you see Analytics heading?

Jim: Given enough traffic for statistical significance, I think we can use the activity on a website to measure the impact of all our marketing. The web is so much a part of daily life now that an ad in the newspaper, an ad on the radio and a direct mail postcard will all have an effect on the behavior on the website. Capture that activity, sift through it and the impact of your marketing spend will be revealed. We’re not there yet, but it’s right around the corner.

Christine: In the good old days of the web, the “hit” was the unit of visitor interaction, at least until we figured out that it was a lousy metric. Then, the “impression”or “page view” became the standard. Now along comes Web 2.0 sites with their richer, more interactive mode of interfacing with a visitor. Google Maps, for example, may occupy a visitor for several minutes, without the URL on the Address line changing. When a visit can no longer be accurately be measured in “impressions” how do we properly quantify web site traffic?

Jim: While those better and brighter than me are working on measuring “engagement”, I am happy to break down a visit to a website into “actions”. Searching for an address is an action. Scrolling the map is an action. Zooming in on a location is an action. Commenting on a blog is an action. These actions add up and spell out the flow of individual activity and quite readily replace the pageview as a means of understanding behavior.

Christine: The search engines are offering analytics tools as a part of their offerings. I’m frequently asked this question by students and clients who invest heavily into online paid advertising. From a web advertiser’s perspective, is there any danger in letting the search engines have such a detailed view of a company’s conversions, revenue, and other business metrics?

Jim: One first has to assume that your website is interesting enough and the data about your website is valuable enough to put a $166.2 billion enterprise at risk.

Christine: Jet lag or bad food? What’s the worst part about travel?

Jim: The worst part about travel is yet to come: when they allow mobile phone on airplanes.

Christine (laughing): Thanks much Jim. I’ll see you in London!

Jim: Looking forward to it!

SEM CLUBHOUSE

Welcome to the Clubhouse, where we share our secrets

Monthly Archives: February 2008

Advances in Crawling the Web

A Chat with Analytics Guru Jim Sterne