Frequently Asked Questions

Q:  How long should I expect a crawl to take?

A:  Because there are variables such as link density and the parameters set for an individual job, crawl times cannot be precisely predicted. But typical approximate run times are:

  • 1000 pages—30 minutes
  • 2000 pages—60 minutes
  • 4000 pages—120 minutes
  • 8000 pages—240 minutes
  • 10000 pages—270 minutes

Q: My crawl is running more slowly than I expected. Why?

A: CAT essentially mimics a browser. When it pings a page URL, it must wait for that page to render in order to scan it for data and additional links. If the site's pages are slow to load, for any reason, CAT is slowed down accordingly since it needs to wait for the page to load.

Q:  The crawl didn’t return all the pages on my site. Why?

 A:  There are several reasons your crawl may not look as you expected.

  • The base URL is too restrictive—CAT works by detecting URL patterns and thus returns URLs that contain the full pattern represented by the base URL. For example, a base URL like www.example.com/page.html will only return URLs that include the /page.html. For a more expansive crawl, truncate the base URL to a less-restrictive level, such as www.example.com (but note that the URL must still render a page when entered into a browser).
  • The base URL does not render a page. Before you submit your job, paste the base URL into a browser and make sure it renders a page. CAT has to begin from an existing page in order to have a starting point to detect and follow links.
  • The CAT crawling process begins with a base URL and the job setup parameters, such as whether or not redirects are followed and external links are included. If redirects aren’t followed and your site relies on a redirect structure, CAT may not find all the pages you expect.
  • If your site includes multiple domains and you have not included those URL patterns in the Include Links field, they won’t be crawled.
  • HTML anchors are key to the crawl process. Pages without an html anchor within the crawl scope are not returned. Resources accessed through Javascript or server scripts don't always expose an HTML anchor, so won’t be returned.
  • Resources must be available to anonymous users.
  • Resources must be available by HTTP GET.

Note that jobs may also fail if the server being crawled is denying access by returning 429 or 420 server codes. You may need to notify the owners of the site that you are planning to crawl the site so that it isn't denied.

Q: My crawl failed before starting. Why?

A: In addition to the reasons a crawl might fail listed in the previous question, there are circumstances under which a the target server may reject a CAT request and the crawl is unable to proceed. They are:

  • IP address validation against a blacklist (CAT requests come from an AWS IP address)
  • Validation based on User-Agent string failed
  • Validation based on cookie support (CAT does not support cookies)
  • Session validation (some required information such as a session id is not included in the request)

Q: How are subscription debits calculated?

A:  All subscriptions are based on the number of pages crawled, so we only debit your subscription by the number of HTML pages CAT finds and gathers data for. Image, video, document, and code files (.js, .css) do not count against your subscription.

Q: If I delete my crawl data, do my available pages reset?

A: No. Once a crawl has completed, the pages are debited from your account. If you wish to run more crawls, whether of the same site or others, you will need to have enough additional pages in your account.  

Q:  What resources does CAT catalog?

A:  CAT parses the HTML looking for URLs in the following tags:

        a href="url"
        link href="url"
        iframe src="url"
        source src="url"
        embed src="url"
        object data="url"

Q:  My site has videos but they weren't returned in the results. Why?

A:  Often, but not always, Flash video is placed on the page using an <embed> tag, and if that embed tag has a src attribute that contains a valid URL, it is discoverable by CAT. But if Flash is placed on the page using javascript, CAT will not be able to catalog it.

Similarly, if YouTube videos are placed on the page in an iFrame or using embed tags, CAT will capture the src attribute and report that in the Links Out but the videos will not be listed in the videos list in Resource Details.

Q:  Why don't the Google Analytics results in my crawl match the report I get from Google?

A: There are two reasons you may find discrepancies between the analytics data in your CAT report and in the Google Analytics dashboard.

1) CAT reports Google Analytics results by URL without rollup or summary by URL pattern. For example, CAT GA results for URL patterns like .../example?page=... are reported for the exact match URL and not rolled up to .../example 

2) There is a nuance with Google Analytics configuration that can result in Google GA appending the host name to the URL path, which causes results not to match. See this article for instructions for configuring your analytics view to prevent that from happening.

Q: Why are some of the pages in my crawl listed as out-of-scope?

A: In-scope resources are those that fall within the parameters set by the combination of a base URL and any include patterns, minus exclude patterns. For these resources, we download and process the HTML for metadata, images and other media, and links in and out. 

Links to resources outside this path are recorded (if Ignore External Links is not checked), but HTML is not downloaded or processed and screenshots are not captured. These resources are considered out-of-scope. That means that the URL may be included in Job Details, but Resource Details will not include metadata, images, documents, links in and out, etc.

In short, for each URL we discover, 

If it begins with the BaseUrl it is Internal
- Internal URLs are InScope unless they contain a string from the <Exclude Links> list

If it doesn't begin with the BaseUrl it is External
- External URLs are not InScope unless they contain a string from the <Include Links> list
- External URLs are ignored if <Exclude External Links> is selected

Q:  Why doesn’t CAT follow Javascript links?

A:  Javascript links often do not expose an html anchor that is discoverable.

Q:  Is there a page limit on CAT crawls?

 A:  The number of pages CAT crawls is limited by your subscription level. If you have a very large or complex site, however, it may be faster and more effective to break the site up into several smaller crawls. If you need assistance determining how to design your crawl, we offer a Crawl Concierge service, for a fee. Contact us for more information.

Q: What if my site is larger than my subscription pages remaining?

 A:  With careful use of the base URL and include and exclude link patterns, you may be able to divide your site structure into several crawlable chunks. Although the dashboard does not currently support combining multiple crawls, you can export your jobs and combine the .csv data into a single spreadsheet.

You can also add blocks of pages to your monthly subscription, or combine multiple blocks, to increase your available pages prior to running your job.

If your site is too large and/or you are unable to break it into crawlable sections, contact us. We may be able to help.

Q: If my crawl completes without capturing the entire site, can it be restarted?

A: Because of the way CAT works, by starting from a base URL and cataloging links as it goes, we currently can't restart a crawl where it left off. For this reason, we suggest that if you aren't sure of your site's size, you buy a level up. If you buy a block subscription, you can use any leftover pages any time; with monthly subscriptions, unused pages roll over for a month.

Q: How can I exclude an section of a site from being crawled?

A:  Use the Exclude Links field to enter the URL patterns you wish CAT to ignore during the crawl. For more on how to set up a job using Exclude links, see Power Tips: Job Setup.

Q: I checked the box to include screenshots, but I'm not seeing them in my Resource Details. Why not?

A:  The process for creating screenshots is separate from and follows the initial crawl of the site links. So screenshots will take longer to be returned in your data. Check your results later to see if they have arrived.

Q: What does Exceeded MaxPages mean?

A:  If the status of your job indicates Exceeded MaxPages, that means that your job reached the maximum number of pages remaining in your subscription level. If you need more pages, you can easily upgrade your subscription by going to My Account and selecting the next level up on the Pricing Plans tab. Purchased pages are cumulative, meaning that any additional blocks or upgrade quantities are added to a single pool of available pages.

Q: I got an error message reading "Head test failed" or "Job validation error." What does that mean?

A: Before CAT can crawl a site, it pings the server to see if there is a response. If the receiving server does not respond or rejects the ping, CAT returns that error. It may be caused by the receiving site being offline, the Base URL not being a valid site URL, or there may be software installed that blocks crawlers. If you are seeing that error, double-check the base URL and repaste it into the Base URL field just as it appears in the browser. If you are still seeing the error, you may need to contact the administrator of the site to see whether there is a crawl blocker in place and, if so, whether it can be temporarily lifted to allow the crawl. See the item "Why did my crawl fail before starting?" above for other reasons a crawl may not be able to start.

Q:  Can CAT be run on a site or site section that requires a log in?

A:  At this time, CAT can only crawl sites that are publicly accessible.

Q: Can I run CAT on my intranet?

A:  Because CAT requires that a site be publicly accessible, it can't currently be run on an intranet without special arrangement with site management. Contact us to discuss, if you are interested in seeing whether we can work with your intranet manager to enable a crawl.

Q:  How can I organize the Job Details in the dashboard view?

A:  Click the column headers (URL, Type, Size, Date, Level, Title) to sort the list of files. You can also use the File by Type and Status Filters dropdowns to limit the items shown. Change the set of columns you wish to see by clicking Edit View and unchecking the columns you wish to hide. Note that exports reflect any filtering or column views you set in Job Details.

To enhance your crawl data in the dashboard, you can also add your own columns and tag files with custom vocabularies. See Power Tips: Managing Views and Custom Columns for more information.

Q:  When I run a comparison and see changes in the Links In and Links Out, what does that mean?

A:  That means that the links themselves have changed, not the content of the page they point to.

Q:  What is the difference between Clone and Re-Run?

A: Cloning a job allows you to copy the job, modify parameters, and re-run the job. Selecting Clone will open the Job Setup view where you can make necessary changes to the job parameters and click Submit. Re-run starts the crawl without changing any parameters and does not require a trip back to Job Setup.

Q: What is a content inventory?

A: A content inventory is a quantitative index of all the pages on a web site. An inventory also typically includes data about the pages——the URLs, the page metadata, the file type, format, date last updated, and the inbound and outbound links.

Learn more about content inventories.

Q: How does CAT work?

A: CAT crawls web sites to retrieve the URLs and their data using cutting-edge crawling and data analysis technologies. It returns a report that can be viewed from within the CAT dashboard or exported as a .csv file for further manipulation in a program such as Excel.

Q: Do I need to install anything?

A: Nope. CAT is cloud-based, accessible from any web browser. All you need to get started is a CAT account. See our pricing page for subscription options.

Q:  I registered but haven't received the validation email. Why not?

A:  If you have registered with a valid email address but haven't received a confirmation email from Content Insight, check your spam folder to make sure that it wasn't filtered out by your email client. You may wish to add the Content Insight domain to your address book to ensure that you don't miss other email communications.

Q: I really wish CAT would do ___. Can I request a new feature?

Sure! Just go to our Feature Request page and tell us what you want. We can'’t promise we can build it, but we'’ll seriously consider any reasonable requests.

Q: Oops. I found a bug. What do I do?

A: Pop on over to our Report a Bug page and let us know what you found. The more detail, the better. Thanks for letting us know!

Have a question or comment? Send it to us. We would love to hear from you.


user-guide.png

User Guide

Read the in-depth documentation for CAT.