Metadata Standards in the Marketplace -
Why Do I Care? (And Where Does Godzilla Fit In?)
Reed and D. Geller
Metadata ("data about data") is essential for data warehousing. In order
to populate a database, extract data, or run a report, more is required
than simply raw data. The tools involved must also "understand" the context,
or meaning, of the data. This is one of the purposes of metadata. Consider
2.5. This could be an amount in dollars, yen, or euros. It could be a
mail stop at a company. It could be the height in meters of a member of
the Boston Celtics. Nothing about the number itself can tell you what
it means. To interpret an item like this, you need metadata.
Metadata is a description of data that tells you how to interpret it.
Sometimes the metadata description can be deduced from the structure or
the column names in a database; sometimes it is found in the programs
that use it, and sometimes it resides only in comments or text. Frequently
it is a combination of these.
dealing with a single database, the use of metadata is as natural as swimming
is to a fish - and just as invisible as the water. However, when dealing
with heterogeneous environments, things get much more opaque. As one TEC
analyst put it, once you've begun building and maintaining a data warehouse,
metadata problems begin to surface "like Godzilla lurching out of Tokyo
harbor." Metadata becomes important in a data warehousing context because
the value of a data warehouse derives from the possibility of having application
suites that tie the source databases with the target data warehouse and
produce the reports that give a data warehouse value.
If you write your own suite of programs or can find a single vendor solution
that handles all of your needs, you might still be able to get away with
ignoring the metadata. But in most cases you'll need to tie together applications
from different vendors, possibly including some homegrown ones, to make
your data warehouse work. It's at this point that you start paying attention
to metadata and, more specifically, metadata standards.
A simplified definition of metadata: One type of metadata defines
the meaning of data. It is made up of entities, attributes, and relationships.
For example, a table is an entity. It has attributes such as a name. A
column is also an entity. It also has attributes such as a datatype. A
column is related to a specific table. This is an example of a relationship.
Any database has hundreds or thousands of entities, each entity having
many attributes and relationships. It is this information that is necessary
for any useful manipulation of data.
note gives a high-level overview of metadata from the viewpoint of warehouse
tools and repositories, and the standards that have been promulgated to
support its use and interchange.
to the Customer
Why should an Information Technology manager be interested in standards
for metadata? The data warehousing industry is rapidly building "suites"
of applications that will allow for:
- The population of a data warehouse or data mart
- Intelligent reporting and analysis of that data.
order to do this, the tools must be provided the metadata information
about the source systems and the target data warehouse. Most suites on
the market are incomplete, so it is common for customers to purchase different
tools from different vendors. If each vendor tool has its own metadata
store (often referred to as a "repository"), the customer has to supply
metadata information individually to each tool, repetitively.
metadata repositories often use static metadata with batch file based
exchange mechanisms. When tables change, each tool must have its metadata
definitions refreshed. As the number of tools and the volatility of the
data warehouse increase, this becomes extremely cumbersome.
Ideally, distributed metadata repositories could dynamically exchange
information and keep each other synchronized. This would greatly reduce
the customer's workload and ensure correct metadata across domains and
life cycles. (Note: currently, even tools from the same vendor may not
be able to share metadata).
In order for metadata to be useful, it must be represented in a format
the data warehouse tool can understand. It must also be accessible to
the tool. Over the years, vendors had many different methods of metadata
representation and storage. Recently, they have formed standards bodies
and reduced the number of standards to two. As Richard Soley, Chairman
and CEO of the Object Management Group described it to TEC, "Better two
standards than ten". It certainly could be argued that one would be even
better than two, but that does not appear likely at this time.
is the primary standards body in this scenario, and has presented a widely
supported standard called CWMI (Common Warehouse Metadata Interchange).
The second standards body involved is the MetaData Coalition (MDC), to
which Microsoft handed over it's standard, OIM (Open Information Model).
Much work has been done to bring the standards closer together, and an
XMI (XML Metadata Interchange) "bridge" has been written to allow OIM-compliant
products to interact with CWMI-compliant products. However, the standards
are substantially different in their implementation. The major differential
in the implementations is that OIM is based strictly on Microsoft standards
and products, and CWMI is an open standard that will also work on UNIX,
mainframes, and other systems.
OMG: Object Management Group. An international organization founded
in 1989 to endorse technologies as open standards for object-oriented
applications. The consortium now includes over 800 members.
Meta Data Coalition. A consortium founded in 1995 with close to 50 vendors
and end-users whose goal is to provide a tactical solution for metadata
Common Object Request Broker Architecture. A standard from the OMG for
communicating between distributed objects (objects are self-contained
software modules). CORBA provides a way to execute programs (objects)
written in different programming languages running on different platforms
(i.e. UNIX, mainframe, Windows), no matter where they reside in the network.
The CORBA standard competes to some degree with DCOM, although through
COM-CORBA bridging products, both can be used in cooperation.
Internet Inter-ORB Protocol. The CORBA messaging protocol used on a TCP/IP
network. It allows programs (objects) to be run remotely in a network.
IIOP links TCP/IP to CORBA's General Inter-ORB protocol (GIOP), which
specifies how CORBA's Object Request Brokers (ORBs) communicate with each
other. When a user accesses a Web page that uses a CORBA object, a small
Java applet is downloaded into the web browser, which invokes the ORB
to pass data to the object, execute the object and get the results back.
Distributed Component Object Model. Microsoft's technology for distributed
objects. DCOM defines the remote procedure call, which allows those objects
to be run remotely over the network. DCOM only functions in a Microsoft
Unified Modeling Language. An object-oriented design language from the
OMG. Many design methodologies for describing object-oriented systems
were developed in the late 1980s. UML "unifies" the popular methods into
a single standard, including Grady Booch's work at Rational Software,
James Rumbaugh's Object Modeling Technique and Ivar Jacobson's work on
use cases. In only four years, UML has become the software industry's
dominant modeling language. UML 1.3 was ratified at OMG's meeting in November
1999. Until June of 1999, Microsoft had their own "flavor" of UML, but
they now adhere to the standard. MOF: Meta Object Facility. This OMG specification
provides a set of CORBA interfaces that can be used to define and manipulate
a set of interoperable metamodels. MOF is a key to integration of metamodels
Extensible Markup Language. The World Wide Web Consortium's document format
for the Web that is more flexible than the standard HTML (HyperText Markup
Language) format. While HTML uses only predefined tags to describe elements
within the page, XML allows tags to be defined by the developer of the
Interface Definition Language. A language used to describe the interface
to a routine or function. For example, objects in the CORBA distributed
object environment are defined by an IDL, which describes the services
performed by the object and how the data is to be passed to it.
Standard Generalized Markup Language. An ISO standard for defining the
format of a text document. An SGML document uses a separate Document Type
Definition (DTD) file that defines the format codes, or tags, embedded
Data Type Definitions. A language that describes the contents of an SGML
document. The DTD is also used with XML, and the DTD definitions may be
embedded within an XML document or in a separate file.
XML Metadata Interchange. A method for two MOF-compliant repositories
to exchange information. When the OMG was questioned by TEC about Microsoft's
position on support for XMI, Grady Booch, Chief Scientist of Rational
Software and one of the developers of UML, stated that they "could not
comment publicly". TEC received the same "no comment" response from every
member of the OMG that we questioned.
Common Warehouse Metadata Interchange. A universal data format for interchange
of metadata between data warehouse and business intelligence products.
Developed by the Object Management Group in conjunction with a consortium
of over 700 vendors.
A competing standard to CWMI. Written by Microsoft and turned over to
the MDC. Co-developed by over 20 companies. Based on standards such as
SQL and COM, and used for metadata interchange.
We believe that the variance between the MDC and the OMG standards will
continue to shrink, due to market pressure from major customers who are
growing tired of having to hand-craft integration strategies between business
intelligence and data warehousing products (probability 80%). A single
standard for "plug and play" metadata interchange will be a powerful market
force in the future, especially for the vendor who implements it first.
are becoming increasingly frustrated at the level of effort required to
integrate different vendors "best of breed" tools. There will come a time
very soon when they will refuse and vendors will be forced to work together,
whether they get along with each other or not.
Ensure compliance with the standards promulgated by the Object Management
Group. Much has been done to ensure interoperability between the two standards
(i.e., the Microsoft XML Interchange Initiative), but it is not yet a
Some vendors have created proprietary application programming interfaces
to allow other vendors to interoperate (for example the Informatica MX2
API), but this limits customers to the use of tools whose vendors have
developed the "plug-ins" to the Informatica product. Other vendors have
taken a pure XML approach, which is a more open course of action.
Customers must make their data warehousing vendors carefully articulate
what standards have been used in the creation of their product, or if
the product is completely proprietary. As the various tools are chosen,
care must be taken to ensure that each additional tool interoperates with
the others in the required manner.
a heterogeneous environment, it becomes even more important, since the
Microsoft solution is a Windows-only one unless bridging products are
used. It is almost unheard of for any major company to have a completely
homogeneous environment, since many legacy and mainframe systems are still
in use (i.e., IBM MVS, DEC VMS, Adabas, MUMPS), therefore, metadata is
extremely important to the customer and must be kept in mind at all times.