Power Tips: Job Setup

Before you start your first CAT inventory, here are a few tips that will help ensure a successful crawl and the results you expect.

How CAT Works

It may be helpful to start with an explanation of what CAT does and doesn't do. CAT is a web crawler (sometimes called a spider)—when pointed at a URL, it starts there and catalogues the links from that page in the order it finds them, then catalogues the data and links from each page subsequently crawled. It continues to crawl until it finds no more links or it reaches the page limit set in job setup or in the subscription level.

Important note: If a crawl completes, whether because the crawler has found all the links or because there aren't enough available pages in the subscription, it cannot be restarted. Please be sure that your subscription will support your crawl; if you aren't sure you have enough pages, we recommend purchasing an additional block. Blocks don't expire, so unused pages can be used later.

Job_Setup_-_updated.png

Setting the Base URL

The first step in setting up a job (or crawl) in CAT is setting the base URL from which CAT will start the crawl.

Before you enter the URL in CAT, enter it in a browser and make sure it's valid and that it does not redirect. If it redirects to another URL immediately, you'll need to enable redirects (see below).

CAT will take that URL pattern very literally—meaning that unless you tell it otherwise via the advanced settings, it will catalog only URLs that contain that base pattern. That means that if your site includes sub-domains of a different pattern, you will need to include those in the Include Links box if you want them included in your crawl.

Take care that your base URL is not too specific. For example, a URL such as www.example.com/resources/page.html would result in a single-page crawl. Set the crawl at the www.example.com to crawl the site or www.example.com/resources/ to crawl the subdirectory (see below for more on limiting to a sub-directory). Note, too, that www.example.com and example.com (no www) are different domains to a crawler. 

Examples:

General Base URL: http://mydomain.com/

With default settings these URLs would be crawled because they contain the base URL:

http://mydomain.com/index.html
http://www.mydomain.com
https://www.mydomain.com

Specific Base URL: http://mydomain.com/index.html

With default settings these URLs would not be crawled because they do not contain the base URL (the index.html)

http://mydomain.com/
http://www.mydomain.com/
https://www.mydomain.com/

Note: Some web site managers use technology to block web crawlers. You may wish to notify the owners of the site you are crawling before submitting your job to make sure the crawl won't be blocked and to let them know to expect the crawl activity.

Include and Exclude Links

If your site includes sections that are on a sub-domain (and therefore the URLs vary in part from Base URL pattern) add those sub-domains in the Include Links box if you want them included in your CAT crawl. 

Your setup would look like this:

Base URL:  www.foo.com
Include Links:  support.foo.com

Similarly, if you want to exclude particular directories or sub-domains, add them in the Exclude Links box. For example, if you are crawling an e-commerce site and don't want hundreds or thousands of product pages returned, add that URL pattern to the Exclude Links.

If your site includes a feature, such as a calendar widget, that dynamically generates pages or pages generated as search queries (look for a ? in the URL), be sure to add that URL pattern or fragment (you can use just the ?) to the Exclude Links box to avoid returning hundreds or thousands of non-useful results.

If your site includes both http:// and https:// versions of the same pages, by default they will both be captured. You can exclude these by entering https:// in the Exclude Links box.

Note: In processing the setup parameters, Excludes are applied first, Includes second.

Limiting Crawls to a Specific Directory

Sometimes you may wish to crawl only a specific directory within your site. CAT makes that possible, but you do need to be careful in how you set up your job parameters. Set the directory as your Base URL, but also add it to the Include Links box and add an asterisk (*) to the Exclude Links box so no other sections are crawled. The asterisk is a special case that matches all URLs. For example, if you wanted to crawl just the Resources section of content-insight.com, your setup would look like this:

Base URL:  www.content-insight.com/resources
Include Links:  www.content-insight.com/resources
Exclude Links:  *

When using the *, note that unless overridden by an Include, all but the base URL will be ignored. Adding * or ? to a URL fragment (mydomain.com*) (mydomain.com?) is not supported. To achieve that, use regular expressions. Learn more about regular expressions.

Avoiding Unwanted Pages

Many sites include features such as filters and calendars that dynamically generate pages when selected. When CAT hits these pages, the auto-generation can happen, potentially resulting in hundreds or thousands of pages in your data that are not necessarily unique pages or may not be useful for your audit. To avoid these sections, review your site in advance of setting up your job. Look for URLs that include query strings or parameters (often indicated by the presence of a question mark in the URL) and add those patterns to your Exclude Links box. 

Max Pages

The maximum number of pages you can crawl is limited by your subscription level. However, if you wish to do a limited crawl to test or don't need the full site crawl, you can set a maximum and CAT will only crawl that number. Note that CAT will crawl in the same way no matter how many pages you set, so if, for example, you set Max Pages to 10, it will crawl the first 10 pages from the base URL.

Important note: A crawl that terminates with MaxPages or subscription limit exceeded will be incomplete and can't be restarted from where it left off. Re-tries will start over from the beginning.

Google Analytics

1.   Add CAT as a user

In order for CAT to gather the analytics data, you need to set CAT up as a user in your account profile. Follow these steps:

a.  Log in to your Google Analytics Account

b.  Click on the Admin link in the orange bar at the top of the page   

adminbutton.jpg

From the Profiles menu items on the left, select User Management. 

GA-add_user.jpg

c. In the "Add permissions for:" field, enter this email: cat-ga@cat-ga-05-2016.iam.gserviceaccount.com (Note: If you have previously enabled Google Analytics in your CAT account, you do not need to change this address.)

d. Click the Add button

e. Notice that the CAT User is now visible in the list of your Profile’s users

2.   Get the View ID

a.  Click on the View Settings menu item

b.  Find the View ID value

ViewID-updated.png

c.  Copy the value

d.  Enter that value into the Profile ID field in Job Setup

Be sure that the Base URL of your job is exactly the same as the URL for the Google Analytics account.

External Links

External links are links that fall outside the combination of the base URL and the included and excluded links. If this box is checked, links won't be included in the general inventory, but will appear as links out from the appropriate pages. If not checked, links will be included in the inventory, but no data is gathered for them (i.e., CAT will not catalog metadata, word count, etc.)

Redirects

By default, CAT will follow redirects, since many sites make use of them, but if you don't want CAT to follow them, check the "Don't Follow Redirects" box. Note that that may result in a crawl that doesn't return all the pages you expect. If you're not sure whether your site uses redirects, leave the box unchecked.

Screenshots

Screenshots are gathered in a separate process from the initial page crawl, so may not appear simultaneously with the crawl data. If screenshots are not immediately available when the crawl completes, check back in a few minutes. 

More Information

If you have other questions about how CAT works or want more detail on how to set up crawls and manage the results, see our job setup video tutorialFrequently Asked Questions and User Guide.