It seems that the future of data warehousing resides in the cloud or at the very least will be strongly dependent on cloud capabilities. Offerings such as Google Cloud Platform, Azure SQL Data Warehouse, Amazon Redshift, and Snowflake Computing promise reliability, elasticity, scalability, and performance, all take on the routine care & maintenance tasks that can bog down IT staffs.
But what if you have already made great strides and/or significant investments toward an in-house data warehouse and don’t want to lose the time, investment, or momentum of that effort? A strategic direction to consider in this case is a hybrid data warehouse.
A hybrid data warehouse approach can be strategic whether you are building from the ground up or evolving an existing data warehouse. A hybrid approach can accelerate the availability of cleansed, integrated, and analytics-ready data at a fraction of the cost of a traditional data warehouse and can facilitate scaling to accommodate the vast sea of data available through streaming and message based data sources. Partial cost savings comes through reduced need for storage and processing resources, but cost is primarily reduced through the significantly reduced labor required to prepare and present data for analytics.
Data Warehouse Status Quo
Before delving into the hybrid approach let’s baseline a definition of a traditional Data Warehouse. A traditional Data Warehouse is generally stored in a row based RDBMS technology using a star or snowflake schema. The data is physically copied from sources systems through ETL jobs and most likely transformed from a third normal form schema. These Data Warehouses are generally focused on reporting (black & white, row & column), monitored and measured by workload (CPU, memory, disk space, and network utilization), and may include cubes, or participate in Master Data Management (MDM).
A traditional Data Warehouse was typically created to serve specific reporting requirements with specific data from specific sources systems. Additional data is generally on-boarded through new ETL-based projects depending on available funding, requirements, and development resources. Often there is some drill down capability for specific business needs but the row based technology prohibits untethered exploration due to the need for indexing in a row based data store.
Issues Forcing Us to Evolve this Approach
There are often resource and security policies governing when and how queries can be run and if data can be copied out of the EDW for additional analysis or data blending.
Since it is difficult to define the value or intent of data mining, exploration, and discovery efforts, these efforts are rarely funded leaving a critical gap in data analysis capabilities.
In some enterprises there is no central data warehouse but instead there are data marts created and used for each business function or unique purpose. Although this approach allows more flexibility for the individual business functions, the business is handicapped without the ability to view any part of their business with a comprehensive 360 degree view and these data marts are typically not reusable for other groups.
The inability to handle large data sets and/or semi-structured data prohibits analytical access to some large and relevant data such as social trends & sentiment, log data, and click stream data, which prohibits countless opportunities to find insight that could improve revenue, cut costs, or drive innovation. You are likely to lose competitive advantage without the ability to analyze or act on real-time events and without self-serve analytics capabilities.
Finally, future additional data onboarding in a traditional data warehouse is costly & lengthy and you will be left with slow performance for any report or query for which the data was not specifically modeled and/or optimized to serve.
So What Can We Do?
Wherever you are in your data warehouse journey the typical end goal is near-real-time access to fully integrated, de-siloed, cleansed and modeled data to best empower and inform the business through reports, analytics capabilities, and visualizations.
So, how can we load and integrate data quickly while optimizing for data mining, analytics, and visualizations without additional delay? How can we handle multiple data types and growing data volume and still deliver fresh data rapidly and with high performance? How can we provide a unified 360 degree view of the various aspects of our businesses?
Envisioning a Hybrid Data Warehouse
The simple answer is to assemble a complimentary suite of data management capabilities including robust back end tools to ingest, store, cleanse, and serve data as rapidly as possible, and also provide self-serve front end tools to enable the business to easily explore, discover, and mine data for relevant insights.
Traditionally we were forced to choose a data warehouse approach of either distributed or centralized approach however with today’s technology we can provide centralized data access while leaving data stores distributed as they are. This approach allows us to leave the data stores ‘as-is’ but still provide centralized access.
Some Data Warehouse data stores will likely still be necessary but should be chosen to fit the purpose. For example, there is no longer a need to include a row based data store for reporting and analytics when there are affordable performant column stores that can store the same data modeled in the same way and can perform faster and that require less support.
By unifying data sources rather than copying data to a central location we can integrate multiple data stores including RDBMS, columnar, NoSql, flat files, web services, etc. Data Virtualization provides this crucial function of unifying data sources in order to centralize access through a single accessible location. DV also expedites data integration, remodeling, transformation, and cleansing on the fly without costly or slow ETL work. This allows us to build virtual or logical data warehouse quickly and easily which can be shared, reused, and maintained with minimal effort. Further, the semantic naming capabilities of a Data Virtualization platform simplifies data access with friendly naming and can serve data governance initiatives.
Finally a hybrid data warehouse should be query tool agnostic allowing each individual analyst or group the choice to use the tools that are best fit for the purpose at hand and/or the tools they are most comfortable and productive with using.
The are several advantages of a hybrid approach over a traditional approach:
• Ability to ingest, process, and analyze streaming data
• Empower business users to explore, discover, and self-serve
• Greatly improve performance of integrated data
• Quicker availability of currently inaccessible data
• Ability to store large data sets and semi structured data
• Provide single source gateway for access to all data
Components of a Hybrid Data Warehouse
So how do we do we make all of this reality? To start with, I would prescribe a minimum of the following core capabilities for a hybrid data warehouse that will be scalable, extensible, and upon which your business can grow for the foreseeable future.
Columnar Data Storage
The value of column stores is in delivering high performance data retrieval with minimal human optimization. Fast data retrieval performance can lend significant advantage to analysts performing exploration and discovery functions and can also lessen adoption concerns. Prior to columnar stores technology teams would need to index row based data stores in order to provide adequate performance for analyst’s queries. This required that technology knew ahead of time what queries the business would run to provide sufficient time to optimize the data store to respond to those queries in an acceptable timeframe. This strategy works fine for static reporting wherein the optimization work has been completed. The report performs predictably well consistently into the future. However, outside of static reporting, this causes a slow cycle of analysis wherein the business analyst would ask a question of the data in the form of a query get the resulting answer, review the results and generate a new query based on any number of factors such as instinct, specific business questions, curiosity, etc. The analyst would then make a request to the technology team to index for the new query which may take hours, days, or weeks depending on team bandwidth and the delivery process. Conversely, a robust columnar store such as Vertica or Par Accel can optimize data automatically without the need for human indexing. For technology, there is no guessing what questions the business will ask. For the business there is no waiting for technology to index the data for your next query. By leveraging columnar data stores an analyst can ask a question, get an answer, ask another question, get another answer and so on. The analyst can pursue insight as fast as (s)he can think and type, instead of as fast as technology can index. This allows analysts to have a conversation with the data rather than technology.
Please note that I have not included row based any RDBMS as a required core component of a data warehouse. The reasoning for this is that in the event you are building from the ground up, you really will not need row based data stores. By leveraging columnar stores for relationally modeled data you will be automatically delivering the performance and maintenance advantages of columnar storage as listed above at a similar cost to investing in row based technology. However, in the event that you already have row based RDBMS there is no reason to abandon it unless that is a specific intention. You can continue to use your traditional EDW, or other row base store albeit with the legacy performance drawbacks. You can leverage the capabilities of other technologies in this list to augment the row based technology and work-around the legacy issues.
There are numerous NoSql (not only SQL) data store options filling as many purposes and use cases: Hadoop, Cassandra, MongoDB, CouchBase, Neo4J, etc. The advantages of the NoSql data stores is multifold and differs with each platform but the most common use cases include storing unstructured or semi structured data at low cost, creating a data lake analytics environment, providing different types of visualization capabilities such as graph analysis, and so on.
Streaming Data & Message Queues
It is becoming increasingly essential to provide access to the vast sea of data available from streaming and message based sources such as click stream data, social feeds, enterprise service buses (ESB), etc. The potential for finding valuable insight within these sources is just now being uncovered and having this data available in your data warehouse can provide your data scientists with as many opportunities for insight as their innovative creativity will allow. Technologies such as Flume, Storm, and Kafka can help build a solid ingestion architecture for both streaming and message based data which can then be populated in a data lake of transformed and stored in a relational store.
Perhaps someday all data will be available via streaming or message queueing but in 2016 we will still need to support flat file and batch data ingestion through an ETL process using products such as Informatica, Ab Initio, or Data Stage.
Data virtualization is a key element to a hybrid data warehouse and products such as Composite, Denodo, and DataVirtuality allow analysts to join queries across physically diverse databases of all different types without the extra steps and time delay of traditional ETL. They can also virtually model data, transform data, and provide user friendly data element naming. They operate as an intermediary access point and house only the necessary metadata to allow cross database joins. Many of these platforms include advanced query optimization and caching capabilities to provide a more robust toolset and include such functionality as the ability to scrape web pages and ingest web service data which can then be presented as relational tables.
Key advantages to data virtualization:
-‘Instant’ data accessibility through a unified data layer
-Logical data mart & warehouses: build quickly and without ETL
-Automated ETL via caching functions
-Empower self-guided exploration, discovery, and prototyping
-360 view of anything
Bringing it All Together
Having all available data accessible from a single location alongside traditionally warehoused data allows deep and broad analysis and the ability to query across data sources with the immediacy only possible through data unification eliminating the need for slow and costly data movement.
Near real-time analysis such as client journey and behavior, social trends and sentiment analysis, operational systems efficiency, are powerful capabilities and if leveraged strategically will output invaluable insights, improved behavior prediction, ideal next step recommendations, better service response, improved ROI, lower costs, and much more.
Build, Buy, or Dust Off What You Own?
Do you need to go out and buy several new products? Maybe. You don’t need to buy everything I have mentioned here in order to evolve and extend your data warehouse. And if you do want to add multiple capabilities, you don’t necessarily need to add them simultaneously. But before you look to buy anything new, take a look at your existing technology assets. Some of your existing data management tools may have functionality you may not be aware of, are not currently using, or to which additional functionality will be added soon in an upcoming product update. In some cases you may be licensing a bundled package of products but only using parts of the licensed functionality. If any of this is the case and you do have additional capabilities in-house that you are not currently utilizing, consider whether to move towards the most enabling and empowering technologies versus further leveraging of existing products. Cost and timing are factors in this decision as is choosing products bet fit for your specific needs. A proof of technology process might be helpful to measure the value of each product and balanced scoring could be the difference between a good decision and a poor investment. Any POC is best structured around a few real world use cases to ensure relevancy of outcome, and the ability to provide balanced comparative scoring to support making an informed decision.
Beyond those core capabilities there are several additional considerations including both technology options and process improvements. Depending on your unique business environment you may want to consider some or all of the following.
A more recent capability that helps accelerate analytical data accessibility is to leverage a massively scalable platform such as MongoDB, Cassandra, or CouchDB to handle your production transactions, store your production data, and also provide analytical access to the data. Unifying these types of data stores through a data virtualization platform provides immediate access to the most recent data and can provide an up-to-the-minute 360 view of anything.
A sandbox environment can support and accelerate analytical exploration and discovery and is a great interim step while working towards a hybrid data warehouse or when exploring data not yet accessible through the data warehouse, data lake, or data virtualization. A sandbox, in this context, is defined as an area in a data store that is separate from, but adjacent to, production data warehouse stores. Analytical group(s) can get full rights to load, create, update, modify, and delete schemas, tables, and data in the sandbox and are assigned read only access to the production data warehouse. This allows the analysts to join queries across sandbox data (data imported into the sandbox) and production data warehouse data without the need to wait for the data warehouse to onboard the data. This serves several use cases such as evaluating whether or not there is sufficient value to onboard the data, or getting a head start on analysis without waiting for a full onboard process to complete.
I have seen both Agile BI and KanBan work very well with BI and analytics projects and initiatives. Agile BI is discussed further here. KanBan is a type of agile development that focuses on a prioritized backlog of work with a funneling approach. As developers complete each story they pull a new project from the backlog and begin development on that initiative. Each story is worked on iteratively and gets released when development and testing are complete. The advantages of KanBan over Scrum is that all of the overhead of scheduling iterations and allocating stories to try to meet specific release dates goes away. The team works at their own pace, without the pressure and mad dashes to release multiple stories simultaneously, and, in theory, each story gets released sooner. In either strategy, periodic and participative retrospectives can facilitate continuous improvement
In-Memory analytics and data stores are gaining in capability and popularity. Although it is debatable whether an in-memory data store offers sufficient value to offset the greater cost when compared to a columnar store I would suggest that in-memory analytics will become more prominent over the next few years.
The promise of temperature based data storage is to provide cost and capacity advantages. By storing the ‘cold’ data that is rarely accessed on the cheapest possible platform possibly archived on tape, CD, etc. Storage costs can be minimized. ‘Warm’ data is more frequently used but not so frequently, or with such urgency, as to justify the fastest and most expensive storage technology. ‘Hot’ data is that which requires immediate access at any time and in a performant manner. Joining queries across the different platforms can easily be performed using data virtualization or could be materialized temporarily as needed through caching, data wrangling, parking in a data lake, etc. Products such as Talena provide GUI based configuration and management of selective data archiving and can simplify data pruning and archiving.
Ingestion and Data Prep tools such as Podium, Paxata, and Trifacta can simplify and accelerate the loading & preparation of data for analytics. These drag and drop tools are easy to use for non-technical analyst staff allowing quicker self-serve analytics along with data quality and cleansing functionality.
Beyond the Data Warehouse: Adoption Considerations
Naturally there are several other factors leading to the success of a Hybrid Data Warehouse:
• Staff structure: centralized or distributed analytics functions
• Finding a champion(s) & a stakeholder(s), to foster buy-in
• Appropriate, necessary, and timely training
• Overcoming company cultural roadblocks
• Choosing and using reporting and visualization tools
• Data archiving and pruning
These factors have been discussed in a full presentation of this material that includes more depth on the entire topic as well as depth on these related considerations. The slides for this presentation are available online here and would gladly be discussed personally by contacting the author.
Dirk Garner is a Principal Consultant at Garner Consulting providing data strategy consulting and full stack development. Dirk can be contacted via email: firstname.lastname@example.org or through LinkedIn: http://www.linkedin.com/in/dirkgarner