The Top 3 Business Drivers for Data Virtualization

• Data virtualization offers best-in-class data integration capabilities to accelerate your analytics, data science and BI initiatives.
• Data virtualization empowers businesses through rapid data discovery, unified data access and the efficiencies of collaborative analytics.
• Data virtualization unleashes the power of self-sufficiency for business analysts and power-users to create as-needed custom views that display information precisely as they’d like for each unique business initiative.
• Data virtualization can save countless hours by eliminating typical roadblocks such as difficult-to-access data, funding for lengthy ETL projects, and the headaches of informal and inconsistent analytics calculations based on siloed data within organizations.
• Data virtualization provides these capabilities by abstracting and simplifying the complexity of locating, joining and filtering multiple simultaneous data sources. Even complicated transformations, cleansing and aggregations can easily be performed through a visual interface without the need for advanced SQL development skills.

Introduction to Data Virtualization

Many organizations face data integration and accessibility challenges as they seek to deliver ever-increasing amounts of data into the hands of more people for exploration and analysis. Data virtualization is an approach and set of technologies and practices to address these challenges and to empower organizations with data. Though data virtualization is not new, or without its complexities, businesses stand to gain value and efficiencies through adoption. Specifically, three primary capabilities are driving businesses towards data virtualization: data unification, business agility and synergies with data governance.

• Enabling discovery for enterprise analytics by providing a single repository to access, manipulate and leverage enterprise information assets through data unification
• Agility in data exploration and discovery accelerates time to insight
• Data virtualization is an effective catalyst for data governance by minimizing redundant and repetitive efforts and driving standardization of KPIs, metrics and reports – improving confidence in the quality and accuracy of the underlying data.

Enabling Discovery through Data Unification – Quick and Efficient Data Access

Data virtualization provides the crucial function of unifying data sources that centralizes access through a single location. Data unification is the process whereby multiple disparate data sources are made accessible from one location without the need for physical data integration, copying or moving data. This approach quickly creates a single repository in which analysts can explore, discover, and query the entire depth and breadth of enterprise information.

By unifying data sources where they exist (rather than copying data to a central location) multiple disparate data stores can be integrated – regardless of geographic location and without delays caused by copying data. Because of this, data virtualization accelerates and empowers data science, business analytics and business intelligence functions by increasing the breadth of data availability, which in turn empowers self-sufficiency.

Data virtualization improves time to business insight by placing all enterprise data at the fingertips of users, including non-traditional data types such as unstructured data, clickstream, web-originated or cloud-based data. Regardless of the existing infrastructure (i.e., a data warehouse, data lake, or data that is currently spread across multiple isolated data silos), data virtualization creates an environment that helps bring everything together now and in the future when new data stores and sources are added.

Business Agility & Collaborative Analytics – Reusability, Consistency, Self Sufficiency

By reducing the analyst’s dependency on IT for data acquisition and data preparation, data virtualization enables self-sufficiency and therefore, agility. Data virtualization makes it possible for business analysts to manipulate data on-the-fly, iterating through multiple perspectives in real time without the need to copy or move the data. This dynamic view creation makes it possible to rapidly prototype, experiment, and iterate to see, manipulate and use the data exactly as needed to meet each unique requirement. No time is wasted to physically cleanse, remodel, prepare, move or copy the data when using data virtualization. These functions are carried out in real time, as needed, and can be quickly and easily modified to meet the needs of each unique data-driven effort. This can save a tremendous amount of time by creating queryable virtual joins in minutes.

Data Virtualization as a Data Governance Catalyst

Through intelligent sharing of information, data governance greatly improves productivity and efficiency of analytical, BI and data science initiatives. Searchable data catalogs, standardized metrics and KPIs, data quality improvements, and master data management (MDM) solutions, are just a few examples of the attainable value through of a well-crafted data governance plan.

Data virtualization makes data governance more efficient and streamlines administration through centralization of data policies and administrative tasks. Since data virtualization integrates data in real time, leaving data in place and eliminating the need for redundant data copies such as staging areas and operational data stores (ODS), there are fewer areas to govern and secure, meaning less administration, less complexity, and less risk. Data governance measures can be applied on-the-fly as data flows through the virtual layer. The centralized nature of governing the data and access through a unified data layer eliminates the need for redundant steps, interfaces, procedures, and the need to examine and audit each individual data source is lessened or removed altogether.

Having a single security and access model to manage and maintain across all data sources greatly simplifies all facets of data security management by providing a single platform for administration rather than needing to juggle the many administrative applications corresponding to each individual data storage server. Data policies can be defined on a shared/common data model or on logical data objects for efficient sustainable management and reuse.

Summary

One or more of these drivers will generally resonate so strongly within an organization that they will pursue the value of data virtualization to meet those specific needs. This generally leads to further leveraging the power of data virtualization in pursuit of additional value through other business drivers for data virtualization as the platform, team, and community mature. Data Virtualization products such as those available from Red Hat JBoss, Stone Bond Technologies, and Data Virtuality, stand out among the crowd as some of the more innovative approaches to Data Virtualization.

Dirk Garner is Principal Consultant at Garner Consulting providing data strategy and advisory services.  He can be contacted via email:  dirkgarner@garnerconsulting.com or through LinkedIn:http://www.linkedin.com/in/dirkgarner

Virtual Data Building Blocks

Data Virtualization provide numerous opportunities for productivity gains and effort reduction especially in the case of using virtual data building blocks.

Creating and sharing virtual data building blocks is a terrific way to save your analysts’ time by reducing duplicative and redundant effort. These building blocks also serve to standardize how KPIs and metrics are calculated within your organization and provide the means for any analyst to use prepackaged datasets to streamline their efforts and ensure consistent data quality and accuracy.

The idea of the virtual data building blocks is to create different types of virtual views that can be stacked as needed to quickly serve new initiatives with minimal additional development time. There are three suggested types of virtual views in this approach: macro, filtering, and micro level virtual views.

At the macro level you would build broad re-usable views without filtering that can be used as a base layers for more specific needs. So, for example, for website traffic you might build a virtual view that contains all web traffic from all devices and for all purposes on all dates. It is unlikely you would ever need to query that view directly because of its broad scope but building it as a foundational layer provides a starting point for anyone wanting to analyze any aspect of traffic.


Virtual Data Building Blocks

On top of the macro view layer you would then create filtering virtual views as needed. In our example of web traffic a filtering view might be all product browsing traffic. Another example might be purchase conversions, by landing page, by device. Filtering views can be easily re-used when similar specific data subsets are needed.

An initiative-specific micro view layer would sit on top of the filtering layer which sits on top of the macro layer. The micro layer joins across one or more filtering layers and can apply additional filters and aggregations so that only the desired data for a specific initiative is presented. This micro layer serves data visualization, data discovery, BI, and many other use cases precisely. These views can be shared and re-used but are less likely to have broad audiences like the Macro and Filtering layers will have.

Dirk Garner is a Principal Consultant at Garner Consulting providing data strategy consulting and full stack development. Dirk can be contacted via email: dirkgarner@garnerconsulting.com or through LinkedIn: http://www.linkedin.com/in/dirkgarner

Evolved Data Warehousing: A Hybrid Data Warehouse Overview


Hybrid Data Warehouse

It seems that the future of data warehousing resides in the cloud or at the very least will be strongly dependent on cloud capabilities. Offerings such as Google Cloud Platform, Azure SQL Data Warehouse, Amazon Redshift, and Snowflake Computing promise reliability, elasticity, scalability, and performance, all take on the routine care & maintenance tasks that can bog down IT staffs.

But what if you have already made great strides and/or significant investments toward an in-house data warehouse and don’t want to lose the time, investment, or momentum of that effort? A strategic direction to consider in this case is a hybrid data warehouse.

A hybrid data warehouse approach can be strategic whether you are building from the ground up or evolving an existing data warehouse. A hybrid approach can accelerate the availability of cleansed, integrated, and analytics-ready data at a fraction of the cost of a traditional data warehouse and can facilitate scaling to accommodate the vast sea of data available through streaming and message based data sources. Partial cost savings comes through reduced need for storage and processing resources, but cost is primarily reduced through the significantly reduced labor required to prepare and present data for analytics.

Data Warehouse Status Quo
Before delving into the hybrid approach let’s baseline a definition of a traditional Data Warehouse. A traditional Data Warehouse is generally stored in a row based RDBMS technology using a star or snowflake schema. The data is physically copied from sources systems through ETL jobs and most likely transformed from a third normal form schema. These Data Warehouses are generally focused on reporting (black & white, row & column), monitored and measured by workload (CPU, memory, disk space, and network utilization), and may include cubes, or participate in Master Data Management (MDM).

A traditional Data Warehouse was typically created to serve specific reporting requirements with specific data from specific sources systems. Additional data is generally on-boarded through new ETL-based projects depending on available funding, requirements, and development resources. Often there is some drill down capability for specific business needs but the row based technology prohibits untethered exploration due to the need for indexing in a row based data store.

Issues Forcing Us to Evolve this Approach
There are often resource and security policies governing when and how queries can be run and if data can be copied out of the EDW for additional analysis or data blending.

Since it is difficult to define the value or intent of data mining, exploration, and discovery efforts, these efforts are rarely funded leaving a critical gap in data analysis capabilities.

In some enterprises there is no central data warehouse but instead there are data marts created and used for each business function or unique purpose. Although this approach allows more flexibility for the individual business functions, the business is handicapped without the ability to view any part of their business with a comprehensive 360 degree view and these data marts are typically not reusable for other groups.

The inability to handle large data sets and/or semi-structured data prohibits analytical access to some large and relevant data such as social trends & sentiment, log data, and click stream data, which prohibits countless opportunities to find insight that could improve revenue, cut costs, or drive innovation. You are likely to lose competitive advantage without the ability to analyze or act on real-time events and without self-serve analytics capabilities.

Finally, future additional data onboarding in a traditional data warehouse is costly & lengthy and you will be left with slow performance for any report or query for which the data was not specifically modeled and/or optimized to serve.

So What Can We Do?
Wherever you are in your data warehouse journey the typical end goal is near-real-time access to fully integrated, de-siloed, cleansed and modeled data to best empower and inform the business through reports, analytics capabilities, and visualizations.

So, how can we load and integrate data quickly while optimizing for data mining, analytics, and visualizations without additional delay? How can we handle multiple data types and growing data volume and still deliver fresh data rapidly and with high performance? How can we provide a unified 360 degree view of the various aspects of our businesses?

Envisioning a Hybrid Data Warehouse
The simple answer is to assemble a complimentary suite of data management capabilities including robust back end tools to ingest, store, cleanse, and serve data as rapidly as possible, and also provide self-serve front end tools to enable the business to easily explore, discover, and mine data for relevant insights.

Traditionally we were forced to choose a data warehouse approach of either distributed or centralized approach however with today’s technology we can provide centralized data access while leaving data stores distributed as they are. This approach allows us to leave the data stores ‘as-is’ but still provide centralized access.

Some Data Warehouse data stores will likely still be necessary but should be chosen to fit the purpose. For example, there is no longer a need to include a row based data store for reporting and analytics when there are affordable performant column stores that can store the same data modeled in the same way and can perform faster and that require less support.

By unifying data sources rather than copying data to a central location we can integrate multiple data stores including RDBMS, columnar, NoSql, flat files, web services, etc. Data Virtualization provides this crucial function of unifying data sources in order to centralize access through a single accessible location. DV also expedites data integration, remodeling, transformation, and cleansing on the fly without costly or slow ETL work. This allows us to build virtual or logical data warehouse quickly and easily which can be shared, reused, and maintained with minimal effort. Further, the semantic naming capabilities of a Data Virtualization platform simplifies data access with friendly naming and can serve data governance initiatives.

Finally a hybrid data warehouse should be query tool agnostic allowing each individual analyst or group the choice to use the tools that are best fit for the purpose at hand and/or the tools they are most comfortable and productive with using.

The are several advantages of a hybrid approach over a traditional approach:
• Ability to ingest, process, and analyze streaming data
• Empower business users to explore, discover, and self-serve
• Greatly improve performance of integrated data
• Quicker availability of currently inaccessible data
• Ability to store large data sets and semi structured data
• Provide single source gateway for access to all data

Components of a Hybrid Data Warehouse
So how do we do we make all of this reality? To start with, I would prescribe a minimum of the following core capabilities for a hybrid data warehouse that will be scalable, extensible, and upon which your business can grow for the foreseeable future.

Columnar Data Storage
The value of column stores is in delivering high performance data retrieval with minimal human optimization. Fast data retrieval performance can lend significant advantage to analysts performing exploration and discovery functions and can also lessen adoption concerns. Prior to columnar stores technology teams would need to index row based data stores in order to provide adequate performance for analyst’s queries. This required that technology knew ahead of time what queries the business would run to provide sufficient time to optimize the data store to respond to those queries in an acceptable timeframe. This strategy works fine for static reporting wherein the optimization work has been completed. The report performs predictably well consistently into the future. However, outside of static reporting, this causes a slow cycle of analysis wherein the business analyst would ask a question of the data in the form of a query get the resulting answer, review the results and generate a new query based on any number of factors such as instinct, specific business questions, curiosity, etc. The analyst would then make a request to the technology team to index for the new query which may take hours, days, or weeks depending on team bandwidth and the delivery process. Conversely, a robust columnar store such as Vertica or Par Accel can optimize data automatically without the need for human indexing. For technology, there is no guessing what questions the business will ask. For the business there is no waiting for technology to index the data for your next query. By leveraging columnar data stores an analyst can ask a question, get an answer, ask another question, get another answer and so on. The analyst can pursue insight as fast as (s)he can think and type, instead of as fast as technology can index. This allows analysts to have a conversation with the data rather than technology.

Please note that I have not included row based any RDBMS as a required core component of a data warehouse. The reasoning for this is that in the event you are building from the ground up, you really will not need row based data stores. By leveraging columnar stores for relationally modeled data you will be automatically delivering the performance and maintenance advantages of columnar storage as listed above at a similar cost to investing in row based technology. However, in the event that you already have row based RDBMS there is no reason to abandon it unless that is a specific intention. You can continue to use your traditional EDW, or other row base store albeit with the legacy performance drawbacks. You can leverage the capabilities of other technologies in this list to augment the row based technology and work-around the legacy issues.

NoSql
There are numerous NoSql (not only SQL) data store options filling as many purposes and use cases: Hadoop, Cassandra, MongoDB, CouchBase, Neo4J, etc. The advantages of the NoSql data stores is multifold and differs with each platform but the most common use cases include storing unstructured or semi structured data at low cost, creating a data lake analytics environment, providing different types of visualization capabilities such as graph analysis, and so on.

Streaming Data & Message Queues
It is becoming increasingly essential to provide access to the vast sea of data available from streaming and message based sources such as click stream data, social feeds, enterprise service buses (ESB), etc. The potential for finding valuable insight within these sources is just now being uncovered and having this data available in your data warehouse can provide your data scientists with as many opportunities for insight as their innovative creativity will allow. Technologies such as Flume, Storm, and Kafka can help build a solid ingestion architecture for both streaming and message based data which can then be populated in a data lake of transformed and stored in a relational store.

ETL
Perhaps someday all data will be available via streaming or message queueing but in 2016 we will still need to support flat file and batch data ingestion through an ETL process using products such as Informatica, Ab Initio, or Data Stage.

Data Virtualization
Data virtualization is a key element to a hybrid data warehouse and products such as Composite, Denodo, and DataVirtuality allow analysts to join queries across physically diverse databases of all different types without the extra steps and time delay of traditional ETL. They can also virtually model data, transform data, and provide user friendly data element naming. They operate as an intermediary access point and house only the necessary metadata to allow cross database joins. Many of these platforms include advanced query optimization and caching capabilities to provide a more robust toolset and include such functionality as the ability to scrape web pages and ingest web service data which can then be presented as relational tables.

Key advantages to data virtualization:
-‘Instant’ data accessibility through a unified data layer
-Logical data mart & warehouses: build quickly and without ETL
-Automated ETL via caching functions
-Empower self-guided exploration, discovery, and prototyping
-360 view of anything


Data Unification Layer

Bringing it All Together
Having all available data accessible from a single location alongside traditionally warehoused data allows deep and broad analysis and the ability to query across data sources with the immediacy only possible through data unification eliminating the need for slow and costly data movement.

Near real-time analysis such as client journey and behavior, social trends and sentiment analysis, operational systems efficiency, are powerful capabilities and if leveraged strategically will output invaluable insights, improved behavior prediction, ideal next step recommendations, better service response, improved ROI, lower costs, and much more.

Build, Buy, or Dust Off What You Own?
Do you need to go out and buy several new products? Maybe. You don’t need to buy everything I have mentioned here in order to evolve and extend your data warehouse. And if you do want to add multiple capabilities, you don’t necessarily need to add them simultaneously. But before you look to buy anything new, take a look at your existing technology assets. Some of your existing data management tools may have functionality you may not be aware of, are not currently using, or to which additional functionality will be added soon in an upcoming product update. In some cases you may be licensing a bundled package of products but only using parts of the licensed functionality. If any of this is the case and you do have additional capabilities in-house that you are not currently utilizing, consider whether to move towards the most enabling and empowering technologies versus further leveraging of existing products. Cost and timing are factors in this decision as is choosing products bet fit for your specific needs. A proof of technology process might be helpful to measure the value of each product and balanced scoring could be the difference between a good decision and a poor investment. Any POC is best structured around a few real world use cases to ensure relevancy of outcome, and the ability to provide balanced comparative scoring to support making an informed decision.

Adjuncts
Beyond those core capabilities there are several additional considerations including both technology options and process improvements. Depending on your unique business environment you may want to consider some or all of the following.

A more recent capability that helps accelerate analytical data accessibility is to leverage a massively scalable platform such as MongoDB, Cassandra, or CouchDB to handle your production transactions, store your production data, and also provide analytical access to the data. Unifying these types of data stores through a data virtualization platform provides immediate access to the most recent data and can provide an up-to-the-minute 360 view of anything.

A sandbox environment can support and accelerate analytical exploration and discovery and is a great interim step while working towards a hybrid data warehouse or when exploring data not yet accessible through the data warehouse, data lake, or data virtualization. A sandbox, in this context, is defined as an area in a data store that is separate from, but adjacent to, production data warehouse stores. Analytical group(s) can get full rights to load, create, update, modify, and delete schemas, tables, and data in the sandbox and are assigned read only access to the production data warehouse. This allows the analysts to join queries across sandbox data (data imported into the sandbox) and production data warehouse data without the need to wait for the data warehouse to onboard the data. This serves several use cases such as evaluating whether or not there is sufficient value to onboard the data, or getting a head start on analysis without waiting for a full onboard process to complete.

I have seen both Agile BI and KanBan work very well with BI and analytics projects and initiatives. Agile BI is discussed further here. KanBan is a type of agile development that focuses on a prioritized backlog of work with a funneling approach. As developers complete each story they pull a new project from the backlog and begin development on that initiative. Each story is worked on iteratively and gets released when development and testing are complete. The advantages of KanBan over Scrum is that all of the overhead of scheduling iterations and allocating stories to try to meet specific release dates goes away. The team works at their own pace, without the pressure and mad dashes to release multiple stories simultaneously, and, in theory, each story gets released sooner. In either strategy, periodic and participative retrospectives can facilitate continuous improvement

In-Memory analytics and data stores are gaining in capability and popularity. Although it is debatable whether an in-memory data store offers sufficient value to offset the greater cost when compared to a columnar store I would suggest that in-memory analytics will become more prominent over the next few years.

The promise of temperature based data storage is to provide cost and capacity advantages. By storing the ‘cold’ data that is rarely accessed on the cheapest possible platform possibly archived on tape, CD, etc. Storage costs can be minimized. ‘Warm’ data is more frequently used but not so frequently, or with such urgency, as to justify the fastest and most expensive storage technology. ‘Hot’ data is that which requires immediate access at any time and in a performant manner. Joining queries across the different platforms can easily be performed using data virtualization or could be materialized temporarily as needed through caching, data wrangling, parking in a data lake, etc. Products such as Talena provide GUI based configuration and management of selective data archiving and can simplify data pruning and archiving.

Ingestion and Data Prep tools such as Podium, Paxata, and Trifacta can simplify and accelerate the loading & preparation of data for analytics. These drag and drop tools are easy to use for non-technical analyst staff allowing quicker self-serve analytics along with data quality and cleansing functionality.

Beyond the Data Warehouse: Adoption Considerations
Naturally there are several other factors leading to the success of a Hybrid Data Warehouse:

• Staff structure: centralized or distributed analytics functions
• Finding a champion(s) & a stakeholder(s), to foster buy-in
• Appropriate, necessary, and timely training
• Overcoming company cultural roadblocks
• Choosing and using reporting and visualization tools
• Data archiving and pruning

These factors have been discussed in a full presentation of this material that includes more depth on the entire topic as well as depth on these related considerations. The slides for this presentation are available online here and would gladly be discussed personally by contacting the author.

Dirk Garner is a Principal Consultant at Garner Consulting providing data strategy consulting and full stack development. Dirk can be contacted via email: dirkgarner@garnerconsulting.com or through LinkedIn: http://www.linkedin.com/in/dirkgarner

Shorten your Analytical Data Supply Chain for Faster Results





2016 will see increasingly widespread popularization of democratized data to serve businesses needs of faster access to data to support mining, exploration and discovery of actionable insight and easier and better tools to do so. Technologies such as streaming data, data virtualization tools, data wrangling tools, and data prep & cleansing utilities will become still more mainstream and easier to use for business users leaving IT staffs to focus more on stability and ‘keeping the lights on’.

Streaming data feeds will become ever more necessary in order for companies to compete. The barriers to entry of streaming are price and expertise and both of these could prove challenging for many companies, especially for smaller enterprises. Dependency on and usage of ETL, data copying and physical re-modeling will begin to wane as the aforementioned tool categories will make it easier and quicker to clean, model, and present data in-place and/or virtually without the overhead and time burden of physically copying and re-modeling it.

With various strategies to gain access to the data faster, business analysts capabilities to analyze sooner will grow progressively and the corresponding effort will become easier as the evolution of hybrid data-blending / analytics products continues. Products such as Alteryx, Looker, and LavaStorm, lessen the skillset required to perform advanced analytics and include many tools to complete the data preparation steps along with visualizing the information.

Data supply chains (the path from production data stores to where the data is stored for analyst activity) can be further shortened through data virtualization products such as Denodo, Rocket Software, or DataVirtuality. Products like these can allow analysts to join queries across physically diverse databases of all different types without the extra steps and time delay of traditional ETL. They can also virtually model, transform, and provide user friendly data element naming. They operate as an intermediary access point and house only the necessary metadata to allow cross database joins. Many of these platforms include advanced query optimization and caching capabilities to provide a more robust toolset.

The increased prevalence of NoSql data stores (Cassandra, Hadoop, Neo4J, etc.) serving production needs will also shorten the data supply chain. These data stores have the capability of storing data in a more analytics friendly manner than traditional relational data stores. With direct or metered access to these types of data stores, when used as production stores, the analysts can act on the most recent data in real time with fewer or no data wrangling steps to slow them down.

Because of all of these new and maturing capabilities there is (finally) a compelling business need for data governance. At least there is a need for data governance in the form of a data glossary or dictionary to help guide and direct analysts to understand and consume appropriate data using standard definitions of data elements, KPIs, metrics, business terms, etc. Data governance is an enabling component to fully democratized data.

We are heading towards a world in which increasingly greater numbers of insights will be discoverable and retrievable in near real time since the events that feed and shape our KPIs and metrics are unfolding at breakneck speed. As we shorten our data supply chain we hasten our ability to analyze and act on insights.

Dirk Garner is a Principal Consultant at Garner Consulting providing data strategy consulting and full stack development. Dirk can be contacted via email: dirkgarner@garnerconsulting.com or through LinkedIn: http://www.linkedin.com/in/dirkgarner