Clarifying Data Governance: What is a Business Glossary, a Data Dictionary, and a Data Catalog?

I often see conflicting and overlapping definitions of business glossaries, data dictionaries, and data catalogs, and consensus of standard definitions of each remain elusive.  Some of this confusion is easily understood considering how data governance typically evolves within an organization. For instance, it can be efficient to start with the creation of a data dictionary or data catalog and subsequently build a data governance program on top of that; likewise for a data quality initiative.  This approach delivers quick wins in data governance while embracing the spirit of ‘agile’.  I will put forth the following as the suggested definitions and elements of each.   My intent and emphasis is to capture the joint value of these assets, to provide specific definitions of each, explain how they fit into a data governance program, and provide examples of each.

Summary of Business Glossary, Data Dictionary, and Data Catalog

Business Glossary

A business glossary is business language-focused and easily understood in any business setting from boardrooms to technology standups. Business terms aren’t meant to define data, metadata, transforms, or locations, but rather to define what each term means in a business sense. What do we mean by a conversion? A sale? A prospect? These types of questions can be answered with a business glossary. Having a business glossary brings common understanding of the vocabulary used throughout an organization. The scope of a business glossary should be enterprise-wide or at least divisional-wide in cases where different divisions have significantly different business terminology. Because of the scope and the expertise needed, responsibility for the business glossary is owned by the business rather than by technology. Often a data steward or business analyst will have this as a sole responsibility.

Data Dictionary

A data dictionary should be focused on the descriptions and details involved in storing data. There should be one data dictionary for each database in the enterprise. The data dictionary includes details about the data such as data type, permissible length, lineage, transformations, and so on. This metadata helps data architects, engineers, and data scientists understand how to join, query, and report on the data, and explains the granularity as well. Because of the need for technical and metadata expertise, the ownership responsibility for a data dictionary lies within technology, frequently with roles such as database administrators, data engineers, data architects and/or data stewards.

Data Catalog

The data catalog serves as a single-point directory to locate information and it further provides the mapping between the business glossary and data dictionaries. The data catalog is an enterprise-wide asset providing a single reference source for location of any data set required for varying needs such as Operational, BI, Analytics, Data Science, etc.. Just as with the business glossary, if one division of an enterprise is significantly different than others, it would be reasonable for the data catalog to be exclusive to the division rather than to the enterprise. The data catalog would most reasonably be developed after the successful creation of both the business glossary and data dictionaries, but it can also be assembled incrementally as the other two assets evolve over time. A data catalog may be presented in a variety of ways such as enterprise data marketplace. The marketplace would serve as the distribution or access point for all, or most, enterprise certified data sets for a variety of purposes. Because of the mapping work requiring involvement from both business and technical expertise, assembling the data catalog is a collaborative effort.

Business Glossary, Data Dictionary, Data Catalog

Summary

Of course, the success you realize from the assembly and use of these data governance assets is entirely dependent on other pillars of a solid data governance program such as a data quality initiative, master data management, compliance and security concerns, etc. Please share your thoughts in the comments section or by direct message.

Dirk Garner is Principal Consultant at Garner Consulting providing data strategy consulting and advisory services.  He can be contacted via email:  dirkgarner@garnerconsulting.com or through LinkedIn:http://www.linkedin.com/in/dirkgarner

See more on the Garner Consulting blog: http://www.garnerconsulting.com/blog-busglossdatadictdatacat.html

 

The Top 3 Business Drivers for Data Virtualization

• Data virtualization offers best-in-class data integration capabilities to accelerate your analytics, data science and BI initiatives.
• Data virtualization empowers businesses through rapid data discovery, unified data access and the efficiencies of collaborative analytics.
• Data virtualization unleashes the power of self-sufficiency for business analysts and power-users to create as-needed custom views that display information precisely as they’d like for each unique business initiative.
• Data virtualization can save countless hours by eliminating typical roadblocks such as difficult-to-access data, funding for lengthy ETL projects, and the headaches of informal and inconsistent analytics calculations based on siloed data within organizations.
• Data virtualization provides these capabilities by abstracting and simplifying the complexity of locating, joining and filtering multiple simultaneous data sources. Even complicated transformations, cleansing and aggregations can easily be performed through a visual interface without the need for advanced SQL development skills.

Introduction to Data Virtualization

Many organizations face data integration and accessibility challenges as they seek to deliver ever-increasing amounts of data into the hands of more people for exploration and analysis. Data virtualization is an approach and set of technologies and practices to address these challenges and to empower organizations with data. Though data virtualization is not new, or without its complexities, businesses stand to gain value and efficiencies through adoption. Specifically, three primary capabilities are driving businesses towards data virtualization: data unification, business agility and synergies with data governance.

• Enabling discovery for enterprise analytics by providing a single repository to access, manipulate and leverage enterprise information assets through data unification
• Agility in data exploration and discovery accelerates time to insight
• Data virtualization is an effective catalyst for data governance by minimizing redundant and repetitive efforts and driving standardization of KPIs, metrics and reports – improving confidence in the quality and accuracy of the underlying data.

Enabling Discovery through Data Unification – Quick and Efficient Data Access

Data virtualization provides the crucial function of unifying data sources that centralizes access through a single location. Data unification is the process whereby multiple disparate data sources are made accessible from one location without the need for physical data integration, copying or moving data. This approach quickly creates a single repository in which analysts can explore, discover, and query the entire depth and breadth of enterprise information.

By unifying data sources where they exist (rather than copying data to a central location) multiple disparate data stores can be integrated – regardless of geographic location and without delays caused by copying data. Because of this, data virtualization accelerates and empowers data science, business analytics and business intelligence functions by increasing the breadth of data availability, which in turn empowers self-sufficiency.

Data virtualization improves time to business insight by placing all enterprise data at the fingertips of users, including non-traditional data types such as unstructured data, clickstream, web-originated or cloud-based data. Regardless of the existing infrastructure (i.e., a data warehouse, data lake, or data that is currently spread across multiple isolated data silos), data virtualization creates an environment that helps bring everything together now and in the future when new data stores and sources are added.

Business Agility & Collaborative Analytics – Reusability, Consistency, Self Sufficiency

By reducing the analyst’s dependency on IT for data acquisition and data preparation, data virtualization enables self-sufficiency and therefore, agility. Data virtualization makes it possible for business analysts to manipulate data on-the-fly, iterating through multiple perspectives in real time without the need to copy or move the data. This dynamic view creation makes it possible to rapidly prototype, experiment, and iterate to see, manipulate and use the data exactly as needed to meet each unique requirement. No time is wasted to physically cleanse, remodel, prepare, move or copy the data when using data virtualization. These functions are carried out in real time, as needed, and can be quickly and easily modified to meet the needs of each unique data-driven effort. This can save a tremendous amount of time by creating queryable virtual joins in minutes.

Data Virtualization as a Data Governance Catalyst

Through intelligent sharing of information, data governance greatly improves productivity and efficiency of analytical, BI and data science initiatives. Searchable data catalogs, standardized metrics and KPIs, data quality improvements, and master data management (MDM) solutions, are just a few examples of the attainable value through of a well-crafted data governance plan.

Data virtualization makes data governance more efficient and streamlines administration through centralization of data policies and administrative tasks. Since data virtualization integrates data in real time, leaving data in place and eliminating the need for redundant data copies such as staging areas and operational data stores (ODS), there are fewer areas to govern and secure, meaning less administration, less complexity, and less risk. Data governance measures can be applied on-the-fly as data flows through the virtual layer. The centralized nature of governing the data and access through a unified data layer eliminates the need for redundant steps, interfaces, procedures, and the need to examine and audit each individual data source is lessened or removed altogether.

Having a single security and access model to manage and maintain across all data sources greatly simplifies all facets of data security management by providing a single platform for administration rather than needing to juggle the many administrative applications corresponding to each individual data storage server. Data policies can be defined on a shared/common data model or on logical data objects for efficient sustainable management and reuse.

Summary

One or more of these drivers will generally resonate so strongly within an organization that they will pursue the value of data virtualization to meet those specific needs. This generally leads to further leveraging the power of data virtualization in pursuit of additional value through other business drivers for data virtualization as the platform, team, and community mature. Data Virtualization products such as those available from Red Hat JBoss, Stone Bond Technologies, and Data Virtuality, stand out among the crowd as some of the more innovative approaches to Data Virtualization.

Dirk Garner is Principal Consultant at Garner Consulting providing data strategy and advisory services.  He can be contacted via email:  dirkgarner@garnerconsulting.com or through LinkedIn:http://www.linkedin.com/in/dirkgarner