Data Wranglers, Unite!

Aug 20, 2014

All you have to lose is your frustration

A recent article in the New York Times, “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights,” discusses the challenges data scientists face in dealing with the sheer volume of data required to make smart business decisions. The article refers to this effort as “data wrangling,” “data munging,” or, our favorite, “data janitor work.”

Content strategists, content marketers, site managers, user experience architects, and SEO experts all deal with large amounts of information too and have similar challenges in extracting insights from the “unruly digital data.”

We created the Content Analysis Tool (CAT) precisely to address that issue and, although we didn’t make the New York Times (alas!), we share the goal of the tools that were discussed: “to automate the gathering, cleaning, and organizing of disparate data, which is plentiful but messy.” If you’ve ever sat in front of a giant spreadsheet full of URLs and wondered how you were going to turn that data into a meaningful strategy for a website, you’ve felt the pain.

Gathering all the information that might possibly be relevant to your content project can be a time-consuming, frustrating task. If doing all that janitor work doesn’t feel like your job description and you’d rather get on with the business of gathering insights rather than data, give CAT a try

Let CAT do the wrangling

How do you start wrangling your site data? Run your CAT crawl, including your Google Analytics (if you have that on your site) so that you have that data included right in your inventory. In your dashboard, use the search, sort, and filter tools to find the information that’s most meaningful for you. Looking for all pages within a particular section of the site? Filter by URL to get the directory structure. Looking for pages with missing titles? Sort on the meta title field. Looking for pieces of content that are very long or very short? Sort on the word count column. Want to see what’s working and what isn’t? Look at your Google Analytics columns and find the high and low points. Just want to see pages or just PDFs? Filter by file type and hide the stuff you don’t want to see right now. When you’ve found a set of files you want to mark in some way, create custom columns and tag them with your own annotations: owner, status, subject… whatever is useful for your project.

If you’ve crawled this particular site before and just want to check out what’s been added, changed, or deleted, use job compare to see the differences between the two jobs. It’s super handy for finding just the new pages if you want to make sure they’ve been properly tested or for finding pages that have been deleted so you know to set up redirects.

The data CAT gathers, the filters, and the other features, such as integrated analytics and custom columns, were designed to help you focus on the information most likely to be useful in a typical content project. If there are other features or data points you would find useful, please let us know. We are always adding to and improving CAT and customer requests help us prioritize our roadmap. 




Paula Land


Paula Land is co-founder and CEO of Content Insight and author of Content Audits and Inventories: A Handbook.

Add Pingback

Please add a comment

You must be logged in to leave a reply. Login »