The Content Analysis Tool (CAT) crawls web sites and returns data for further analysis, enabling a wide variety of activities, from content management, to data mining, to business intelligence, to snapshot-in-time, and more. The content inventories created by CAT can be viewed from within the dashboard or exported as a .csv file suitable for further analysis in tools such as Excel.
CAT is a web-based software-as-a-service solution, so there is nothing to download or install. Simply go to the Pricing Plans page, set up an account, select your subscription level, and get started.
The Content Analysis Tool (CAT) allows you to set up jobs and fine tune results by telling the crawler exactly what URL paths and patterns to follow and what data to return for each URL fetched.
The Dashboard view gives you easy access to view what's in your job queue and your list of completed jobs, and allows you to take a number of actions, including viewing all job data, adding custom columns and tagging files, re-running a job, or deleting it.
Key features of CAT include
In CAT, a site crawl is referred to as a Job. To set up a new job, select the Job Setup tab.
The Job Setup tab
Setting up a Project allows you to group multiple jobs, similar to files in a folder. For example, you may have a project for each web site you inventory or for each client. It is not required that you create a project for each job, but it is useful for organizing multiple crawls.
Your Project names will be retained in a project list. Once you have more than one, a dropdown will allow you to select a project to which to add any new jobs.
Each job is an individual crawl. To set up a job, give it a name, a description, and a base URL from which to start.
The first step in setting up a job (or crawl) in CAT is setting the base URL from which CAT will start the crawl.
Before you enter the URL in CAT, enter it in a browser and make sure it's valid and that it does not redirect. If it redirects to another URL immediately, you'll need to enable redirects (see below).
CAT will take that URL pattern literally—meaning that unless you tell it otherwise via the advanced settings, it will catalog URLs of that same base pattern. That means that if your site includes sub-domains of a different pattern, you will need to include those in the Include Links box if you want them included in your crawl.
If Follow redirects is selected, the crawler traverses redirects for the link. If not selected, the crawler records that the link was redirected but doesn't traverse and return data.
When Exclude external links is selected, if a link points outside the domain of the base URL and the included links you designate it will never be followed. If this box is unchecked, however, the server will return information about the resource the link points to, such as server status (for example, 200 OK), resource type (text/html, or image/png, for example) and other data. If checked, links that are out of scope are ignored. Note: Checking this box can speed up your crawl.
External resources are never fetched.
Include Links is a list of link patterns you wish to have crawled in addition to the Base URL. Enter link patterns or fragments here, separated by spaces.
In Include Links, shorter URL strings increase the likelihood of matches and will return more results.
Exclude Links tells the crawler which paths to ignore, allowing you to fine-tune your results.
If your site includes sections that are on a different domain (and therefore the URLs don't match the Base URL pattern) add those sub-domains in the Include Links box if you want them included in your CAT crawl.
Your setup would look like this:
Base URL: www.foo.com
Include Links: support.foo.com
To exclude particular directories or sub-domains, list them in the Exclude Links box. For example, if you are crawling an e-commerce site and don't want hundreds or thousands of product pages returned, add that URL pattern to the Exclude Links.
Sometimes you may wish to crawl only a specific directory within your site. CAT makes that possible, but you do need to be careful in how you set up your job parameters. Set the directory as your Base URL, but also add it to the Include Links box and add an asterisk (*) to the Exclude Links box so no other sections are crawled. For example, if you wanted to crawl just the Resources section of content-insight.com, your setup would look like this:
Base URL: www.content-insight.com/resources
Include Links: www.content-insight.com/resources
Exclude Links: *
CAT does not support wildcard matching. Use of the asterisk is supported only when used as shown above and only to exclude everything other than what is encompassed in the Base URL + Include Links scope.
If Include Screenshots is selected, CAT will generate and store a snapshot-in-time of each HTML page. The images are viewable in the Resource Details view and can be downloaded by opening in a browser window and saving.
Including screenshots may cause the job to take longer to complete. Images will be captured as soon as possible, but may be captured after the crawl itself has completed.
Your subscription level limits the number of pages CAT will crawl within the subscription period. If you wish to set a maximum for a particular crawl, enter the page limit you wish to set in the Maximum Pages field. The crawl begins at the top level of the base URL and each link is followed the first time it is detected (in order to avoid duplicates). When the limit is reached, the crawl will stop. Indication that the maximum number of pages was reached will be indicated in the Job Queue.
You can always purchase more pages and storage to supplement your subscription level. See the Pricing page for details and options.
If there is a Google Analytics account associated with the site you are crawling, you can grant CAT access to that data to gather and display in the job details and resource details. Including this data in your CAT job data is simple, but requires a few extra steps to get set up.
1. Add CAT as a user
In order for CAT to gather the analytics data, you need to set CAT up as a user in your account profile. Follow these steps:
NOTE: If you have previously set up Google Analytics access, you do not need to change this email account. Use your existing account setup.
2. Get the View ID
Be sure that the Base URL of your job is exactly the same as the URL for the Google Analytics account.
The CAT dashboard tab is your console for reviewing and managing your in-progress and completed inventory jobs. From this tab, you can view the job queue, access completed job information, select jobs for comparison and navigate to the results, modify and re-run jobs, archive jobs, and delete completed jobs.
The CAT dashboard
The Job Queue lists jobs that are scheduled or running, shows the status of each job in progress, and allows you to cancel jobs if they have not completed.
Canceling a job means that any data that has been gathered will be deleted and no longer accessible.
When a job has finished running, it will appear in the Completed Jobs section, organized by run date (with most recent jobs at the top of the list), then project name.
The complete jobs list allows you to view the project a job is assigned to, the name of the job, the description, and run date, as well as select from a set of actions.
In the Completed Jobs List, you can view the results of a completed job by clicking the Open icon.
You can also select two jobs for comparison.
Cloning a job allows you to copy the job, modify parameters, and re-run the job. Selecting Clone will open the Job Setup view. Make necessary changes to the job parameters and click Submit.
Re-run is a quick way to recreate exactly the existing job and start a new job without requiring routing through Job Setup.
Edit allows you to easily move a job to a different project, rename it, or add or modify the description. Click the Edit icon, make your changes, and click the Save icon to save your changes.
Deleting a job will remove it from the list and delete all data.
When a job has completed, it can be viewed by clicking the Open icon from the Actions column.
Job Summary and Details view
The Job Summary lists the total number of files found in the crawl, by type.
The filters affect the list of files shown in Job Detail list. If no filters are selected, all files are shown. Check and uncheck the boxes next to the types to limit the results below.
From the Completed Job view, a number of actions can be taken on the data:
Selecting export allows you to download the crawl data as a comma-separated .csv file for import into another program, such as Excel, for further manipulation. See Exporting Job Data, below, for more detail.
View Job Parameters takes you back to the Job Setup view, in read-only mode, so you can review how the job was set up.
Re-run allows you to re-run the job exactly as configured.
Cloning a job allows you to copy the job, modify parameters, and re-run the job. Selecting Clone will open the Job Setup view allowing you to change any of the settings before re-running.
Deleting a job will remove it from the list and delete all data.
To change the set of columns that appears in Job Detail view, click Edit View from the Actions menu. Checkboxes appear next to the columns that can be hidden; uncheck the ones you wish to hide and click Save View.
Create up to three custom columns and fill with your own tags. You can edit directly in the cells or create a set of values; values will appear in a drop-down selector in the cells.
To add custom columns and vocabularies:
To view or edit custom column values in Resource Details, see the Custom Tags and Notes section. There you can view or change the values set in Job Detail or add values if you haven’t previously.
You do not have to create a set of values. You can also edit directly within the cells of the Job Detail table.
The Job Detail list includes the following data:
In-scope resources are those that fall within the parameters set by the combination of a base URL and any include patterns, minus exclude patterns. For these resources, we download and process the HTML for metadata, images and other media, and links in and out.
Links to resources outside this path are recorded (if Ignore External Links is not checked), but HTML is not downloaded or processed and screenshots are not captured. These resources are considered out-of-scope.
To view the details of a listed resource, click the green arrow at the end of the row. Resource Detail View opens.
Resource detail view
In the Resource Detail view, if you chose to include screenshots in Job Setup, you will see a snapshot-in-time of the page accompanied by all the details captured during the crawl.
Images will be captured as soon as possible, but may be captured after the crawl itself has completed.
The following data is available in this view:
A key feature of CAT is the ability to compare one completed job to another and see what has changed, been added, or deleted. Select jobs for comparison by clicking the checkboxes in the Compare column and clicking Compare selected jobs.
The Job Comparison screen will open.
The Job Summary indicates the two jobs being compared and a summary of the changed files.
The file list shows original and changed, added, or deleted files. To view changes, click the green arrow to the right of the original file to see the comparison results in detail.
If you wish to export job data from CAT for further manipulation in another program, such as Excel, select Export from the Job View. The .csv file that downloads contains the following data:
Sign up for a free trial of CAT
(no credit card required for trial!) or
buy one of our convenient subscription options.
Visit our library of articles about the Content Analysis tool, content inventories, and content audits.
Annotated screenshots illustrate CAT features.
Interested in learning about how CAT can work for your organization?