Turning Dark Data from Cost Center into Gold

Turning Dark Data from Cost Center into Gold

In today’s world, where virtually everything has been automated, our businesses produce and store untold amounts of data. A subset of that data is leveraged to create business value or at least insights that can help inform decision-making. Much, if not most, of the data, remains untapped and has been termed “dark data” by Gartner and others.

Gartner defines dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” It includes all data objects and types that have yet to be analyzed for any business or competitive intelligence or aid in business decision making.

Most businesses are sitting on a proverbial goldmine of dark data today…and creating, even more, each year as new systems, SaaS application islands, IoT devices, and automation get deployed or acquired. This is in addition to a virtual archaeological dig of legacy IT systems most businesses still run on today.

The keys to future success and business growth are right there under our noses, yet few companies go out and tap this dark data and become enlightened and enriched by it.

It’s What You Do with Dark Data that Creates Business Value

For some, dark data is stored and maintained for regulatory compliance. For others, “data hoarding” is company policy…keep it around in case it’s needed for something someday. Meanwhile, it’s taking up space and driving latent costs underneath the business.

For most, dark data is a cost center today. It’s an unused asset that’s been overlooked and is for the most part simply ignored. For others who are aware of and concerned about the ongoing storage costs, dark data gets periodically pruned and purged, which reduces its costs but destroys potential future value.

With the advances of deep learning, there are now machine learning algorithms and techniques that can turn what has historically become dark data into real business insights and value.

Why Isn’t Dark Data Being Leveraged Today?

There are many challenges that keep data in the dark, including:

  • Nobody clearly responsible for inventorying, tracking and leveraging dark data at the company.
  • Lack of specific business objectives that can leverage dark data.
  • Technical barriers such as widely varying data formats, data sitting in physically and geographically disperse, incompatible systems.
  • High costs and scarcity of data scientists who know how to acquire, process, clean and filter raw dark data into usable enlightened data.
  • Lack of vision by management team to transform the company’s vast dark data troves into business value, perpetuating dark data’s ongoing fate at most companies.

Why Do Something with Dark Data and Why Now?

The recent surge in Artificial Intelligence and Deep Learning advances make it both convenient and cost-effective to deploy the tools required to automate the analysis of data to generate competitive advantage and new business value.

Deep learning models often entail processing vast amounts of data for proper training. Today there’s a plethora of proven machine learning inference engines, models and algorithms available to quickly generate insights and business value from data. There are also off-the-shelf cognitive services that make tapping into the power of AI/ML quick and cost-effective, including forecasting, image recognition, speech recognition and many others.

It’s now more cost-effective than ever to migrate, store and archive large datasets in the cloud; i.e., it no longer entails massive capital investments that must be justified to the CFO and executive team to create custom data warehouse projects in on-premises datacenters.

However, most of the dark data continues to lurk around the edges of the business or is squirrelled away within various data islands (e.g., Salesforce, accounting systems, e-commerce systems, security systems, warehouse systems, etc.), making it challenging and expensive to mine and leverage dark data quickly and easily to test and prove out its business value.

Getting to a Single Source of Truth

Today, business leaders recognize the enormous value cloud computing delivers. Once senior management also recognize that leveraging dark data to feed AI and machine learning should be a higher priority, this strategic decision sets the table for what comes next – creating a “single source of truth” for that data.

Fortunately, there are tools that make automating data extraction, filtering, cleansing and transporting it to a central location (usually the cloud these days) faster, easier and more cost effective.

These tools can extract dark data from virtually any data repository and format in which it is stored today globally, then filter and migrate the relevant data to where it can be centrally processed and enlightened. Once the dark data has been migrated into the cloud, it can also be compressed and archived cost-effectively for future use, regulatory compliance, etc.

Data Warehouse, Data Lake, or What?

Over the years, many attempts have been made to create the holy grail for data management and analytics – a single source of truth that houses all the company’s analytics and decision support needs. Each of these data storage and corresponding analytic approaches have demonstrated their strengths and weaknesses in practice.

Usually, the costs associated with creating data warehouses and maintaining them has turned out to be too high to be sustainable. And certainly, far too complex and expensive to also house all that dark data, too.

Fortunately, leading cloud providers are delivering solutions to these challenges. For example, Microsoft Azure provides Azure Data Lake (ADLS), a flexible data management solution that accepts data as Common Data Format (CDM). The data is stored cost-effectively using underlying Azure Blob object storage for petabyte-scale files and trillions of objects that can be efficiently queried and leveraged. ADLS provides a compelling foundation for quickly creating a single source of truth in the cloud today. And because ADLS stores data in using CDM compatible common schema, a broad range of popular tools are instantly compatible, including Power BI, Azure Machine Learning Studio, and more. And numerous Azure Cognitive Services make it easy to put dark data to use.

AWS provide a rich set of AI, ML and analytics capabilities, whereby S3 typically serves as the low-cost, highly durable data lake. Once data is extracted, filtered, transformed from its source(s) and moved into S3 buckets, it is well positioned for use by most AWS services, including AWS Forecast, Rekognition, Sagemaker, Redshift among others.

Bridging the Islands of Dark Data with the Cloud to Create Business Gold

Now that we have a cost-effective place to create a single source of truth for our data in the clouds, how can we bridge the gaps that exist between where the dark data is created and stored today with these cloud services?

Buurst recently announced a new product named Fuusion that is coming soon. As shown below, Fuusion bridges disparate data sources from existing on-prem, SaaS, legacy and new IoT/edge sources with Cloud Services.

Fuusion uses Connectors to data sources and cloud services to quickly bridge incompatible data formats, with the right levels of filtering, routing and data cleansing.

Fuusion eliminates the need, in most cases, for a data scientist specialist to manually deal with all these data sources and formats. It also eliminates the need to invest in expensive, time-consuming DevOps projects to custom develop these integrations.

Instead, a Fuusion user visually drags and drops a Fussion Template, a set of data connectors and other processing blocks, onto a web-based canvas in the Fuusion Editor and configures it to quickly create custom Fuusion data flows.

Each data flow acquires, filters, transforms and moves dark data from its present location into a suitable, centralized cloud data lake. Once the data is in the data lake, it’s available for further processing for AI/ML training, inferencing and cognitive services, analytics and/or display.

Using the Azure Data Lakes example introduced above, Fuusion connects data from one or more disparate data sources, which can be any type of structured or unstructured data, with ADLS by normalizing the data into CDM formatted data. At that point, Power BI can be used to perform advanced analytics, machine learning and dashboard display, creating business value from what was previously unleveraged dark data.

Another common use case revolves around taking time-series data and creating forecasts. For example, AWS Forecast delivers highly accurate forecasts and as an off-the-shelf cognitive service, requires no DevOps or specialized AI/ML data science skills when connected via Fuusion. Fuusion gathers up time series data from one or more Excel spreadsheets (or CSV files), for example, and transforms the data, moves it into S3, and then executes AWS Forecast jobs to generate the forecast, which is returned to the user in Excel formatted spreadsheet files. Once Fuusion is deployed and configured, users can generate any number of forecast jobs with Excel spreadsheet skills, extending the power of AI and machine learning to where much of dark data lives today, in user’s filesystems and desktops.

Fuusion provides an alternative to expensive, time-consuming custom DevOps projects that are typically required to connect dark data and edge data with powerful cloud services. It also minimizes the amount of data science required to deploy and leverage dark data using off the shelf cloud-based cognitive services and other applications.

Buurst Fuusion will bridge off-cloud and dark data with major cloud services in 2H 2020. Stay tuned for more details on Fuusion, including some demos of cloud service integrations mentioned in this post. To learn more about Fuusion and its availability or to register for the beta program, please contact the Buurst team.