Tackling Big Data

By Steven Toole October 24, 2013

Pragmatic Marketer Volume 10 Issue 1

While the term “big data” may be overused these days, the problem it defines is no less a reality. Frankly, hardware and software products developed by some of the people reading this helped create this monster. It’s become faster, easier and cheaper for our customers to churn out more and more data. This has created a problem for our customers—and an opportunity for us.

The two problems associated with all big data are:
 
Cost. It costs money to host, store, find, secure, back up, move, archive and recover, just to name a few. And by its very nature, unstructured data is hard to find, organize, prune and protect.

Risk. The risk factor lies in holding on too long to sensitive data, such as credit card numbers, social security numbers, business transactions, email conversations and anything else that could spell trouble when landing in the wrong hands.


Data typically does not increase in value with time; it grows in obsolescence each day. The more unstructured (and mostly obsolete) data to weed through for discovery and review, the greater the cost and risk.


THE BIG DATA OPPORTUNITY 

The good news is that from every customer pain is born new solutions, product features and capabilities that drive even greater value for your products. And increased value means greater differentiation and revenues for your products. 

Text analytics engines, which are designed to automatically mine and analyze a large volume and variety of text-based data, can help you capture that value by turning big data into usable data.

But if they aren’t already part of your product’s capabilities, you’re not alone. In an October 2012 report1, Gartner reported that 2011 IT spending driven by big data functional demands totaled $27 billion. The report also stated, “In 2016, IT spending driven by big data will reach $55 billion.”

The big data wave is still in its relative nascence, and the agile product team still has a chance to gain a first-mover advantage in bringing a more intelligent, big-data proof product to the market well ahead of the competition.

Four primary inflection points have turned big data into an opportunity for technology marketers: the doubling of memory size; advances in network speed, capacity and reduced pricing; solid state storage; and continued CPU performance increases.

To put a finer point on the opportunity text analytics provide in the world of big data, Gartner also clearly recommends that organizations engage in early adoption of machine data and text mining through 2015, and look to embedded capabilities from traditional vendors thereafter. 


TEXT ANALYTICS ENGINES EXPLORED 

First, it’s important to realize that not all text analytics engines are created equal. Let’s look at the three main types of text analytics engines:

  1. Boolean keyword
  2. Lexical techniques
  3. Concept-aware technology

Boolean keyword. Anyone who’s ever done a Google search should have a pretty good grasp of how Boolean keyword search works. Enter keywords with arguments such as “AND,” “OR,” “NOT” or “NEAR” to narrow the results that you might get by simply entering a word or phrase. This works well if the user knows exactly what to look for, and the search term or phrase is the right amount of unique to yield few false positives without omitting results that should be produced. Conversely, the shortcomings of Boolean keyword search are caused by synonyms, acronyms, abbreviations, misspellings and multi-lingual limitations.

For example, searching for “dog” will yield anything containing the term “dog,” including hot dog, Snoop Dogg, dog stocks, bird-dogging, Elvis Presley’s 1956 hit single “Hound Dog” and the Charleston RiverDogs minor league baseball team. However, an entry that talks about a German shepherd might not come up if the entire entry omits the term “dog.”

Lexical techniques. This includes using word lists, dictionaries and thesauruses. It can be more accurate than Boolean keyword search, because it is based on predefined lists that contain related terms. Using the “dog” example, someone would need to build a list of all the possible representations of the word “dog,” such as doberman, German shepherd, beagle, hound, etc. In this case, searching for “dog” will reference the dog library and produce results containing any of the terms in the library. The shortcoming of this technique is that it’s labor intensive to build libraries, and they need to be updated constantly as new terms like “labradoodle” emerge. They also need to be populated with common abbreviations (such as “Cardi” for Cardigan Welsh corgi), acronyms (such as BMD for Bernese mountain dog) and misspellings (in the off chance a user can’t correctly spell dachshund, shih tzu, schipperke or chihuahua). And if the set of searchable content is in more than one language, a separate table needs to be created for every search term in every language.

Concept-aware analytics engines. This last type is used by U.S. Federal Government Intelligence Agencies, as well as thousands of law firms, because it is the most effective approach to managing unstructured big data. It organizes unstructured content conceptually, the way humans do, without the need to build complex libraries or keyword lists.

Concept-aware analytics engines work, ironically, like a hound dog that gets trained on a scent—searching for anything containing that scent. Concept-aware engines are fed sample content (the “scent”) that conceptually represents a category, such as dogs. Sample content could be documents, emails or web content (in whole or part). Using our dogs example, a user would identify several sample documents about dogs. The engine maps the concepts in those documents, converts the text into mathematical algorithms and calculates the relationships between words in a document using a high-dimensional mathematical space. It can then find conceptually related results (i.e., “find more like this”), regardless of misspellings, acronyms, abbreviations, synonyms and even language.

No passages or documents about hot dogs, Snoop Dogg or the Charleston RiverDogs would return, unless there was also a reference to the animal (such as the RiverDogs’ mascot, Snoop Dogg’s canine pet, or the description of a Dachshund). Entries containing abbreviations (Cardi), acronyms (BMD), misspellings (“Dockshound”) would all be included, even if they do not contain the word “dog” entirely. The system can be trained in any language by simply using example text in the language desired.

Weeding through millions of unstructured documents and emails is a big task, but concept-aware text analytics provide a fast and effective way to get right to the pertinent conceptually related items.

Once the desired categories are defined using example text, concept-aware auto-categorization sheds light on the “dark data” comprising a large amount of big data. Using examples of what the user is looking for, the system can “find more like this” and the user can take whatever action is desired.


GET STARTED

Regardless of your product’s market, you can drive increased value for your customers by using text analytics engines to reduce the big data problem—and to increase the big data benefits. Here are eight use cases where concept-aware text analytics engines can tackle big data.

1. Archiving. Sample documents of old email newsletters and outdated marketing documents can be used as examples to find similar documents that can be considered for defensible deletion, dramatically reducing the clutter without having to manually inspect each document and email.

2. Compliance. Once the junk has been pared down, concept-aware categorization can be used to enable greater precision in determining exactly which documents and messages are required to be archived—and for how long—according to retention policies and regulatory requirements.

3. Collaboration. Auto-categorization dramatically improves the ability of users to consume and properly apply internal research assets and intellectual property that can be leveraged elsewhere in the enterprise or for external consumption. It makes documents easier to find, dramatically improving collaboration, sharing and syndication of valuable content.

4. Social media/brand management. Social media is a big part of the “big data boom.” Highly unstructured, it lends itself to organization, grouping similar content together, finding related terms and instantly identifying new terms as they evolve.

5. Content management. Web content that’s not categorized limits web visitors’ ability to find what they’re looking for, and limits the publisher’s ability to monetize valuable content (even if it’s crowdsourced content). Auto-categorizing web content as it’s produced enables web publishers to apply far richer categories to content than humans typically do and is far more cost effective.

6. Databases (ERP, CRM, etc.). Unstructured content exists even within structured databases. Concept-aware analytics engines provide users with a far more effective way to slice through mountains of unstructured content in databases and organize it for greater business value and more informed business decisions.

7. Records management. Platforms that categorize company records have been exposed to the challenges of unstructured big data for some time. Concept-aware categorization can be applied to records management as a more effective way to conceptually group or tag documents and apply the appropriate policies to those document sets. Whether the policy relates to archival, defensible deletion, retention, sharing, restricting or any other action, concept-aware categorization analytics engines can help address records management challenges in a highly effective, cost-efficient way.

8. Security, privacy and forensics. Content that either no longer has value for the organization or is not marked for retention through compliance could be an unnecessary liability. Sensitive customer data, such as medical records, Social Security numbers, credit card numbers or (worse yet) illicit materials are a virtual time bomb. Concept-aware auto categorization can reduce risks by enabling you to identify these materials, dispose of them in a highly defensible way and demonstrate that your company’s information-governance policies are enforceable and consistent.

Despite the hype around big data, it poses both challenges and potential benefits for your customers. The pace at which big data is increasing is nothing short of mind boggling, creating market opportunities that are increasing in size daily.

Turning unstructured big data into reduced risk, reduced cost and increased value depends on visionary product teams who have identified customer challenges and have made a commitment to addressing these challenges with innovative solutions.

Concept-based auto-categorization has proven itself as a highly effective, extremely fast and incredibly precise approach. The possibilities are endless for applying it to big data to address its major obstacles and to harvest its broad benefits.

Steven Toole

Steven Toole

Steven Toole is vice president of marketing for Content Analyst Company, based in Reston, Va. (www.ContentAnalyst.com). Content Analyst is a leading developer of text analytics software engines used by dozens of software product companies, information services firms and systems integrators internationally. He can be reached at smtoole@contentanalyst.com.

Looking for the latest in product and data science? Get our articles, webinars and podcasts.