The Art and Science of Inventory Setup

Oct 11, 2016

CAT is intended to be a self-service, set-it-and-let-it-run tool, but setting up an inventory for maximum effectiveness can be a bit of an art and a science. The internet is full of sites constructed in many ways and you can save time and frustration by spending a few minutes clicking through the site you plan to crawl to identify some of the gotchas that can cause your results different than you expected.

If your crawl returned either fewer pages or more than you expected, there are several reasons that may have happened:

1.  Getting one page result:  this is usually a result of setting the base URL to be too restrictive—i.e., setting it to a specific page vs. a directory level. If you set your base URL to something like: www.example.com/example.html the crawler will assume that the entire URL, including "example.html" has to be in every URL (machines are literal like that). Instead, the crawl needs to be set at the www.example.com level so it will crawl all sub-directories and pages below that. If you are truncating back, however, be sure that the base URL renders a page when pasted into a browser. Also note whether the URL changes when pasted into a browser and copy and use that version if it does.
 
2. Getting more results than you expected: If you don't have a good idea of how many pages are on your site, you may be surprised when you get more results than expected. A quick way to get an idea of how many pages you have it togGo to Google and type "site:<your URL>"—without quotes—to see how many page Google indexes. This will also help you make sure you have enough pages in your account to get a complete crawl.
 
In addition, there are lots of ways pages can be generated dynamically and it's worth spending time before setting up the crawl to research how the site is constructed so you can avoid those. For example, calendars are notorious for creating pages—often into infinity—when they are hit. It's unlikely that having calendar pages in your inventory or audit is particularly useful, so that's an obvious one to exclude. Blogs and news areas can also cause problems, if the articles are navigable by multiple URLs—for example, if there are multiple categories and tags as well as dates that can appear in URLs, a single blog article is potentially reachable in dozens of URL patterns. So looking for those and excluding the ones you don't need is another good pre-crawl practice. Another culprit is when the site has every URL accessible via both http: and https:. Both will be indexed because again, to a machine, those are different URLs. Easily excluded by adding https (or vice versa) to the Excludes box. 
 
When you do get crawl results that have extraneous results, mine them to see what you can exclude in a follow-on run.
 
Find more tips for setting up Includes and Excludes in this article.
 
And, as always, if you have questions, feel free to contact us. We are more than happy to help you get a good CAT crawl.


Tags:
Category: Content Inventory

Paula Land

user_avatar

Paula Land is co-founder and CEO of Content Insight and author of Content Audits and Inventories: A Handbook.


Add Pingback

Please add a comment

You must be logged in to leave a reply. Login »