Open Source Business Intelligence: The Quiet Evolution


Originally Published - January 19, 2009

In the current economic climate, organizations have to review and rationalize expenses associated with all enterprise software. As a direct consequence, open source business intelligence (BI) is emerging as an important choice for new as well as existing BI implementations. Even though most analyst research indicates that its evolution may have been understated thus far, open source BI is growing rapidly. Open source BI solutions have already been proved to complement and integrate well with traditional BI environments. In their own right, open source BI vendors offer competitive technologies and present the irrefutable advantage of cost savings.

The term "open source software" is often assumed to mean "free access to source code." However, the scope of open source software has widened considerably. The open source software license (often referred to as copyleft) is subject to regulations defined by the Open Source Initiative (OSI)—which dictates that open source software cannot discriminate against groups or technology and must be free to distribute.

Such software can be modified, but redistribution of modified licenses may be conditional (for instance, the license may require that changes be patch files rather than be integrated with the original code) to protect the integrity of the original author's work. The best-known open source package is the LAMP solution stack, which is comprised of the Linux operating system, the Apache hypertext transfer protocol (HTTP) Server, the MySQL database management system, and programming languages including PHP, Perl, and Python.

Open Source and Business Intelligence: The Common Thread

"Open source applications" is the term that describes systems built using open source software in the form of frameworks or libraries. Although copyleft licenses do not permit organizations to resell software developed using open software, mechanisms such as dual-license models have arisen, whereby commercial vendors can deliver their software under a community license that follows the open source license regulations and offers a commercial license with an attached fee. Vendors may charge users for services such as support, training, consulting, and advanced features.

In the past two years, commercial open source vendors have been working actively towards establishing a long-term position in the enterprise applications space. In February 2007, the Open Solutions Alliance (OSA) was formed to bring together commercial open source software businesses; its main purpose is to broaden the horizon of open source applications and most importantly, foster interoperability between them. JasperSoft, one of the pioneers of open source BI is among the founding members of this alliance. Pentaho, another open source BI vendor, has set itself apart by leading and sponsoring all of its core projects, implementing open industry standards and establishing partnerships with vendors of data warehouse technology, such as InfoBright and ParAccel.

BI has some of the most challenging technology problems among all enterprise software applications. These challenges include the design of very large databases; complex data integration between disparate and multiple data sources; the ability to search across a surfeit of information; and some of the most stringent performance and latency requirements. Even with proprietary solutions, organizations need a team of experienced professionals—including database administrators, business analysts, and programmers—to implement and support a data warehouse and BI environment.

Open source BI goes one step further: it encourages organizations to use and modify the software as needed and share advances with the rest of the community. It seems only natural that open source and BI technologies have converged. A crucial factor to consider when adopting an open source BI solution is that underlying technologies are often, if not always, open source themselves; although not mandatory, it is prudent to have technical teams acquire the necessary skills. For instance, most open source BI software is built on the LAMP stack. In order to adopt and maintain the applications, technical teams need to have development and administration skills using the LAMP stack.

Pentaho: A Case in Point

That open source BI can provide a full-fledged solution to an organization's BI needs can be demonstrated by looking at how Pentaho's platform addresses the principal requirements of BI—data integration, reporting, and analysis.

ETTL with Kettle

Pentaho's BI platform implements the Common Warehouse Metamodel (CWM). The CWM, which has been implemented by proprietary vendors such as Informatica, is a specification that proposes using XML metadata interchange (XMI) to interchange data warehouse metadata. This entails that mappings can be migrated between tools that implement this interface. Pentaho's extract, transform, and load (ETL) system is based on its Kettle project. Kettle stands for "Kettle ETTL Environment," where ETTL is Pentaho's acronym for "extraction, transformation, transportation, and loading" of data. The ETL system supports: a variety of steps (a step represents the smallest unit in a transformation and contains either predefined or custom logic that is applied to each row as it makes its way from the source to the target); slowly changing dimensions (SCDs); connectors for a multitude of data sources (access to proprietary databases such as Microsoft SQL Server and Oracle is via Java Database Connectivity [JDBC]); and the ability to execute and schedule jobs both locally and remotely. Scripting in Javascript as well as pure Java allows developers to add custom code in any step of the transformation.

Two challenging issues that organizations face are data volume and latency requirements. In order to support high data volume environments, Pentaho has a clustering solution (a solution that uses more than one node or computing entity in order to achieve high performance and availability) that works alongside database partitioning; by using slave servers (a group of servers that perform specific tasks using the data sent by the master server) to distribute central processing unit (CPU) and input/output (I/O) load, performance can be improved by way of this parallelism. However, change data capture, which is based on a data integration technique that triggers data transfer by listening for changes in data sources, is not supported. Changes in data sources are detected by reading transaction logs; with the exception of open source databases, transaction log readers are seldom open source.

Although Pentaho's data integration still lacks a data quality and data cleansing solution, the development of a profiling server (a server dedicated to performing profiling tasks that help discover aberrations in data; see Distilling Data: The Importance of Data Quality in Business Intelligence) seems to be on the list of imminent improvements. In such situations, where the vendor does not support a specific functionality, organizations can look to complementary open source solutions; the DataCleaner project from, for instance, provides functionality to help profile data and monitor data quality. It also points to a significant advantage with open source applications: the fact that software is developed by the community and for the community makes it much simpler to share innovative solutions quickly and seamlessly.

Analysis with Mondrian

The structure of enterprise business activities is almost always multidimensional. This is because the content of a business is defined in terms of quantifiable or measurable properties (e.g., sales, inventory, or donations) and qualitative attributes (e.g., students, customers, or products). Each business activity can involve a combination of these quantitative and qualitative entities. Although enterprise systems may actually store incoming activities in a relational format, a highly responsive, multidimensional environment is required to analyze and gain insight into the entire business.

Online analytical processing (OLAP), still a growing field in terms of research and development, refers to a manner of storing and querying very large volumes of data across multiple dimensions. The particulars of multidimensional OLAP (MOLAP) versus relational OLAP (ROLAP) still evoke a vigorous debate. But the choice depends entirely on the nature of data, latency, and resources (both hardware and software). For instance, ROLAP may provide a better solution for data that is dimension-intensive or in situations where latency needs to be very low or close to real time. On the other hand, MOLAP may be better suited for large sets of aggregations and more lenient latency requirements. In either case, adherence to sound design principles is essential for creating a successful OLAP solution.

Pentaho's answer to the question of multidimensional analysis is a ROLAP engine called Mondrian. The most important aspect of OLAP is how and where aggregations are stored. In a ROLAP environment, as with Mondrian, data and aggregations are stored in a relational database. Precomputed aggregates are stored in tables alongside the base fact tables. Such aggregate structures are necessary to avoid calculations over millions of fact records for each query. These tables are not part of the analytical engine; they have to be built using an ETL-style process. Pentaho offers a tool called Aggregation Designer that helps create and maintain aggregate tables. Mondrian includes an in-memory aggregate cache that saves multidimensional result-sets on first access for use in subsequent calculations. The extensive CacheControl application programming interface (API) is included for granular access to Mondrian's cache.

Organizations can choose from several approaches to provide a client tool for multidimensional analysis. A complementary open source project called JPivot offers a pivot-table client tool, written in Java Server Pages (JSP), to browse cubes created using Mondrian. Mondrian also provides a multidimensional expressions (MDX) interface (note that this is not entirely the same as Microsoft's implementation of MDX). Developers can write in-house applications using olap4j (or, OLAP for Java), an open specification being developed by several open source companies including Pentaho, JasperSoft, and LucidEra.

All that Jazz—Reports and Dashboards

All BI environments need to provide a complete suite of tools to create, publish, distribute, and schedule rich report content. Pentaho's Classic Engine is based on the banded reporting design. Banded layouts divide the report into sections, and the reporting engine traverses the data and fits the data into the predefined bands. In the classic banded engine, the data being sent to the report determines how the report appears. This paradigm has been—and continues to be—used successfully by many reporting tools. However, a relatively new model that is gaining popularity is based on the output rather than the data driving the processing of a report. The Flow Engine, still in development, will work on a report definition built using the Document Object Model (DOM); the final output will be rendered by combining the definition and incoming data. The Report Designer includes support for several data sources, a variety of formatting options, and the ability to render multilingual reports in hypertext markup language (HTML), portable document format (PDF), and Excel (XLS), among other output formats. An AJAX-based, thin-client, ad hoc reporting tool is also available in Pentaho's reporting suite. Reports in the ad hoc tool can be designed using the metadata layer, while the report designer, in addition to the metadata layer, can connect directly to data sources.

Pentaho Dashboards provide the ability to define metrics that are important to an enterprise and deploy them using a variety of user interface features: spreadsheet-style grids, integration with Google Maps, cross-tab reports, and drilldown to reports or multidimensional analysis. Integration to other web content through frames or AJAX components is also possible. The Community Dashboard Framework (CDF), developed by senior members of Pentaho's community, makes it simpler to develop new dashboards by defining the various components using a fairly straightforward syntax, without programming the interface. Forum discussions indicate that Pentaho may integrate the CDF into its product.

The Closing Statement

Although open source BI solutions may not yet have the longevity or maturity of traditional BI solutions, the evolution of open source BI is gaining momentum as its credibility and relevance increase. With minimal risk, organizations can discover whether open source BI will work for them by building application prototypes. Components of open source BI software can also be integrated into existing BI implementations for additional functionality. There is significant transparency in terms of technologies and product roadmaps. Collaborations and partnerships between open source vendors are constantly being established. Committed user communities make it possible to benefit from experience across several companies and platforms.

To learn more about open source software, visit

To learn more about business intelligence solutions, visit

About the Author

Anna Mallikarjunan is a member of TEC's research & development team. She is responsible for the analysis and development of TEC's decision support software as well as tools for business intelligence (BI). She has over four years of business analysis, design, and development experience in several areas of BI, including data warehousing; extract, transform, and load (ETL); online analytical processing (OLAP); reporting; and custom application development.

Past positions Mallikarjunan has held include technical lead and applications development manager of a team of .NET, data warehousing, and BI professionals for a fashion retail company. In this role, she was responsible for the development, maintenance, and support of Windows and Web-based applications, as well as an operational data store, data marts, and BI applications.

Mallikarjunan holds a BSc in computer science from the University of Madras (India), and an MSc in computer science from Anna University in Madras, India.

comments powered by Disqus