A: Because there are variables such as link density and the parameters set for an individual job, crawl times cannot be precisely predicted. But typical approximate run times are:
A: CAT essentially mimics a browser. When it pings a page URL, it must wait for that page to render in order to scan it for data and additional links. If the site's pages are slow to load, for any reason, CAT is slowed down accordingly since it needs to wait for the page to load.
A: There are several reasons your crawl may not look as you expected.
Note that jobs may also fail if the server being crawled is denying access by returning 429 or 420 server codes. You may need to notify the owners of the site that you are planning to crawl the site so that it isn't denied.
A: In addition to the reasons a crawl might fail listed in the previous question, there are circumstances under which a the target server may reject a CAT request and the crawl is unable to proceed. They are:
A: All subscriptions are based on the number of pages crawled, so we only debit your subscription by the number of HTML pages CAT finds and gathers data for. Image, video, document, and code files (.js, .css) do not count against your subscription.
A: No. Once a crawl has completed, the pages are debited from your account. If you wish to run more crawls, whether of the same site or others, you will need to have enough additional pages in your account.
A: CAT parses the HTML looking for URLs in the following tags:
Similarly, if YouTube videos are placed on the page in an iFrame or using embed tags, CAT will capture the src attribute and report that in the Links Out but the videos will not be listed in the videos list in Resource Details.
A: There are two reasons you may find discrepancies between the analytics data in your CAT report and in the Google Analytics dashboard.
1) CAT reports Google Analytics results by URL without rollup or summary by URL pattern. For example, CAT GA results for URL patterns like .../example?page=... are reported for the exact match URL and not rolled up to .../example
2) There is a nuance with Google Analytics configuration that can result in Google GA appending the host name to the URL path, which causes results not to match. See this article for instructions for configuring your analytics view to prevent that from happening.
A: In-scope resources are those that fall within the parameters set by the combination of a base URL and any include patterns, minus exclude patterns. For these resources, we download and process the HTML for metadata, images and other media, and links in and out.
Links to resources outside this path are recorded (if Ignore External Links is not checked), but HTML is not downloaded or processed and screenshots are not captured. These resources are considered out-of-scope. That means that the URL may be included in Job Details, but Resource Details will not include metadata, images, documents, links in and out, etc.
In short, for each URL we discover,
A: The number of pages CAT crawls is limited by your subscription level. If you have a very large or complex site, however, it may be faster and more effective to break the site up into several smaller crawls. If you need assistance determining how to design your crawl, we offer a Crawl Concierge service, for a fee. Contact us for more information.
A: With careful use of the base URL and include and exclude link patterns, you may be able to divide your site structure into several crawlable chunks. Although the dashboard does not currently support combining multiple crawls, you can export your jobs and combine the .csv data into a single spreadsheet.
You can also add blocks of pages to your monthly subscription, or combine multiple blocks, to increase your available pages prior to running your job.
If your site is too large and/or you are unable to break it into crawlable sections, contact us. We may be able to help.
A: Because of the way CAT works, by starting from a base URL and cataloging links as it goes, we currently can't restart a crawl where it left off. For this reason, we suggest that if you aren't sure of your site's size, you buy a level up. If you buy a block subscription, you can use any leftover pages any time; with monthly subscriptions, unused pages roll over for a month.
A: Use the Exclude Links field to enter the URL patterns you wish CAT to ignore during the crawl. For more on how to set up a job using Exclude links, see Power Tips: Job Setup.
A: The process for creating screenshots is separate from and follows the initial crawl of the site links. So screenshots will take longer to be returned in your data. Check your results later to see if they have arrived.
A: If the status of your job indicates Exceeded MaxPages, that means that your job reached the maximum number of pages remaining in your subscription level. If you need more pages, you can easily upgrade your subscription by going to My Account and selecting the next level up on the Pricing Plans tab. Purchased pages are cumulative, meaning that any additional blocks or upgrade quantities are added to a single pool of available pages.
A: Before CAT can crawl a site, it pings the server to see if there is a response. If the receiving server does not respond or rejects the ping, CAT returns that error. It may be caused by the receiving site being offline, the Base URL not being a valid site URL, or there may be software installed that blocks crawlers. If you are seeing that error, double-check the base URL and repaste it into the Base URL field just as it appears in the browser. If you are still seeing the error, you may need to contact the administrator of the site to see whether there is a crawl blocker in place and, if so, whether it can be temporarily lifted to allow the crawl. See the item "Why did my crawl fail before starting?" above for other reasons a crawl may not be able to start.
A: At this time, CAT can only crawl sites that are publicly accessible.
A: Because CAT requires that a site be publicly accessible, it can't currently be run on an intranet without special arrangement with site management. Contact us to discuss, if you are interested in seeing whether we can work with your intranet manager to enable a crawl.
A: Click the column headers (URL, Type, Size, Date, Level, Title) to sort the list of files. You can also use the File by Type and Status Filters dropdowns to limit the items shown. Change the set of columns you wish to see by clicking Edit View and unchecking the columns you wish to hide. Note that exports reflect any filtering or column views you set in Job Details.
To enhance your crawl data in the dashboard, you can also add your own columns and tag files with custom vocabularies. See Power Tips: Managing Views and Custom Columns for more information.
A: That means that the links themselves have changed, not the content of the page they point to.
A: Cloning a job allows you to copy the job, modify parameters, and re-run the job. Selecting Clone will open the Job Setup view where you can make necessary changes to the job parameters and click Submit. Re-run starts the crawl without changing any parameters and does not require a trip back to Job Setup.
A: A content inventory is a quantitative index of all the pages on a web site. An inventory also typically includes data about the pages—the URLs, the page metadata, the file type, format, date last updated, and the inbound and outbound links.
Q: How does CAT work?
A: CAT crawls web sites to retrieve the URLs and their data using cutting-edge crawling and data analysis technologies. It returns a report that can be viewed from within the CAT dashboard or exported as a .csv file for further manipulation in a program such as Excel.
A: Nope. CAT is cloud-based, accessible from any web browser. All you need to get started is a CAT account. See our pricing page for subscription options.
A: If you have registered with a valid email address but haven't received a confirmation email from Content Insight, check your spam folder to make sure that it wasn't filtered out by your email client. You may wish to add the Content Insight domain to your address book to ensure that you don't miss other email communications.
Sure! Just go to our Feature Request page and tell us what you want. We can't promise we can build it, but we'll seriously consider any reasonable requests.
A: Pop on over to our Report a Bug page and let us know what you found. The more detail, the better. Thanks for letting us know!
Have a question or comment? Send it to us. We would love to hear from you.