Paradoxes of Software Estimation
Written By: Murali Chemuturi
Published On: August 18 2006
Software development has spawned an independent industry, with organizations offering software development services exclusively. As it is perhaps in the nascent stages, the processes of asking for service, offering a service, and pricing are all somewhat haphazard. Software development falls into the category of the services industry as opposed to the product industry—that is, a service is offered, and not a product. Many parallels can be drawn with similar service industries. The major difference between the software service industry and other service industries, however, is that software is much more highly priced and complex. Where there is complexity and money, academics step in; research is conducted; jargon is developed; concepts are proposed; and a new branch of science or engineering comes to life.
There are many paradoxes inherent to the software development industry. This paper discusses only some of them. Why not all? The reason is simple: as of yet there is no comprehensive document on the sum total of paradoxes, and work is still ongoing.
Paradox of Estimation
Why do we carry out software estimations? Typically, for three reasons: to price software development and maintenance contracts; to estimate resources; and to manage delivery commitments.
When we estimate for resource estimation or delivery commitments, we can always estimate for each of the activities—say, for coding, for code walk-through, for testing, and so on—to arrive at the resource requirements for each of these activities. And by summing up the individual requirements, we can arrive at the total resource requirements for the project.
Software estimation becomes contentious, however, when we estimate for software pricing. We need to arrive at an estimate that is understood by the client's purchaser, who is to be assumed not to be a software developer, and who is expected to be unfamiliar with software estimation. Typically, the scenario is as follows:
- There are multiple bidders.
- The capability levels of the bidders vary, and in some cases they vary vastly.
- There will be techno-commercial negotiations—which are mostly commercial in nature.
- The technical question is something like "how did you arrive at this price?"
- The answer is expected to be in non-technical terms, and to facilitate comparison with other bids.
This is the scenario: the negotiators want a universal norm that can be applied across platforms, across organizations, across technologies, and across the board. This is the crux of the software estimation concern.
Paradox of Software Size
There is ample literature on sizing software. Attempts have been made to measure software the way distance or weight is measured, or to find a unit of measure that is acceptable to all. The result is that we have many measures of software size: examples include lines of code, function points, use case points, object points, feature points, internet points, test points, and there may even be more. True, there are multiple measures for distance (miles and kilometers) and for weight (pounds and kilograms). But you can convert pounds into kilograms and miles into kilometers! There is no formula that says 1 use case point is equal to 1.2 function points, or anything like that!
Everyone agrees that some things are not amenable to measurement, such as beauty and love. Each person is beautiful in a unique way, and everyone loves in one's own way. We don't attempt to measure these things in beauty points or love points—do we?
There are many examples in industry too. We do not measure a car—do we have car points to say that a BMW has twenty-five car points and a Toyota has fifteen car points, and that therefore BMW is superior by ten car points? How does one compare different cars? We don't have a measure for cars that permits that comparison to be made.
We don't attempt to measure software products either. What is the size of SQL Server or Oracle? Both are multi-user relational database management systems (RDBMS), but when buying, do we ask their sizes to get a fair comparison? We also do not also have a fair measure for computer hardware. How do we compare a AS/400 with an RS/6000? Are there computer points to measure their sizes?
Is there any measure for buildings? A gym, a theatre, a home may all be ten thousand feet square dimensionally—but do they all have the same measure? Do we have a size measure to compare them? And lastly, let us take catering service. The same menu served at the same place gets vastly different pricing—it depends on the caterer and other specifications. Can we ask them the size of their meals—say, in meal points?
The fact of the matter is that not everything is amenable to measurement. To be able to measure, the flowing criteria should be satisfied:
- The thing being measured must be homogenous.
- It needs to be physical and tangible.
- It should be monolithic and not an assembly of multiple parts (physical or metaphysical).
- It should not have any qualitative features.
- The measure must be physical and tangible.
- When there are multiple units of measure, it must be possible to use a conversion factor to convert one measurement into the other units of measure.
The examples we looked at, however, are sized up using lists of features, qualitatively described.
Paradox of Software Productivity
Productivity is roughly defined as "X units of output per unit of time." The definition of standard time (productivity) goes thus: "Standard time is the unit of time taken to accomplish a unit of defined work carried out by a qualified worker after adjustment using a given method in specified working conditions at a pace that can be maintained day after day without any physical harmful effects." This definition is specified by the American Institute of Industrial Engineers (AIIE).
Thus, in the manufacturing industry, productivity cannot be stated in a stand-alone mode: it has to be accompanied by specification of the defined unit of work, the work environment, the working methods, the tools and technologies used, and the qualified workers. Needless to say, productivity varies from organization to organization, even for well-established measures of productivity.
We have universally accepted measures of time, such as person-hours, person-days (PDs), person-months, and person-years. However, we are yet to see a universally accepted unit of measure for software output.
We could see software productivity as lines of code per PD, function points per PD, use case points per PD, object points per PD, and so on. In the manufacturing or traditional service industries, productivity is measured for one activity at a time (for example, for turning activities, milling, brick-laying, waiting tables, soldering, and so forth).
Productivity measurements of inspection activities and functional testing are measured only in mass or batch production industries, but are not attempted in the job-order (tailor-made to customer specifications) industry. Productivity measurements of design activity and repair (bug-fixing) activities are likewise not attempted, as they are considered to contain a creative component in the work.
We have not replicated this manufacturing industry model in software development industry, and we have not defined what software productivity is. These are the questions that crop up normally: Does productivity mean coding only, or does it also include coding, code walk-throughs, independent unit testing, and debugging? Does productivity include systems analysis and design work too? What about the inclusion of project management overhead?
In most cases I have witnessed, software productivity is specified for the entire development life cycle, with no tacit agreement as to what constitutes a "development lifecycle."
In the manufacturing industry, productivity is specified for an activity, and overall throughput is called capacity. The capacity of a plant or an organization takes into account all operations, all departments, and all activities, and specifies one figure—say, 300 cars per day, or 1 million tons per year, and so on.
Does this sound familiar? It ought to, as we frequently hear phrases like "fifty lines of Visual Basic code per person per day," or "two days per screen"! We seem to be confusing capacity with productivity.
The software industry has not so far engaged an industrial engineer to study and come up with possible measures of software productivity. Incidentally, the industry is bereft of unions and the resulting negotiations thereof. Perhaps that is the reason why no attempts have been made to carry out scientific studies in the field of software productivity.
Thus, while there are concerns and issues, there are also solutions, so long as we do not look for one single measure or productivity for the entire workflow of software development. What needs to be accomplished is a definition of a taxonomy of software productivity, and the publication of an industry standard. This will facilitate further work.
Paradox of Offering Fixed Bids
Many services offer fixed bids—there's nothing peculiar about that. Architects offer a fixed bid after receiving complete building specs. The made-to-order industry offers a fixed bid only after receiving complete specs, and the quote would include a high-level design drawing too. A caterer would not offer a fixed bid until receiving the menu and the number of guests.
A builder offers a fixed bid, with an escalation clause, after receiving the building plans. In the construction industry, unit rates are mostly offered against a detailed tender document that gives great detail about each of the items. The total cost of the building depends on the actual quantity of the different components of the building. Software is more like the construction industry! Here's why:
- It is difficult for the users to visualize the final deliverable from the design documents (drawings).
- Users continuously ask for changes.
- There are a lot of qualitative features.
- It is very difficult to ascertain the quality of the end product simply through inspection, as destructive testing damages the product and renders it unusable.
- The variety of available components is huge, with huge difference in their quality.
- Acceptance testing is often conducted in hours or days for what is built in months or years.
- More often than not, the user feels that a better deliverable should have been possible for the price paid, or that a better vendor should have been chosen.
Paradox of Actual versus Estimated
Estimation data in other areas comes not from the people who do the job, but from work-study engineers (industrial engineers) who specialize in work measurement. In the software industry, estimation data comes from programmers or project managers—and it is derived from actual historical data.
Why don't the other industries use historical data for arriving at estimation data? Because the actual amount of time taken for a piece of work varies and depends on multiple factors:
- the skill level of the person doing the work (super, good, average, fair, or poor);
- the level of effort put in by the person (super, good, average, fair, or poor);
- the motivation level of the person;
- the environment in which the work is carried out;
- the methods of working; and
6. the clarity of instructions
We can bring in some uniformity for the last four factors, but the factors of skill and effort vary even within the same organization.
To my mind, the paradox of actual times can be best described by an analogy to an Olympic marathon: the distance to be run is known; the actual times for earlier marathons are known; all the participants are well-trained for the event; and the marathon conditions are well controlled, with no unexpected changes. Yet still the participants do not take the same amount of time to complete the race! If even the best conditions still produce variations in actual results, then how can a software project, with its myriad uncertainties, meet the estimated times?
Industries other than the software development industry (such as manufacturing, mining, and so forth) follow the concept of "a fair day's work for a fair day's pay": an estimation is based on the average effort put in by a person of average skill in specified conditions. These industries have carried out work-studies in their respective organizations, and come up with standard times for most of the activities that take place. They have also developed a number of techniques for effort estimation, including time study, micro-motion analysis, analytical estimation and synthesis, and the like. Everyone's work is measured. If a person happens to be more skilled and puts in more effort, more money is paid by way of incentive.
The actual time differs from the estimated time because of variances in the skill of the person and the effort put in. This is recognized in the industry, and estimates are never revised simply because the actual time has a variance with the estimated time. Estimates and the norms for estimation are changed only when there is a change in working environment, tools, methods, or the work itself.
The software industry has never undertaken work-studies for software development taking refuge in the idea that software development is a creative activity. But that is not really the whole truth: the creative component is obviously present in the software design, but not in coding. The concept of a fair days' work for a fair day's pay is also unheard of in the software development industry, maybe because some software engineers are paid far better than others.
Norms for software development estimation are drawn from earlier projects, and are continuously updated based on cues from completed projects. However, there are really only a few possible scenarios:
- The "Eager Beaver" Project Manager Scenario
An estimate is given to the eager beaver (EB) project manager. In order to please the boss, the EB beats the estimate. The project postmortem thus concludes that the estimate was an underestimate, and estimation norms are tightened (instead of the EB being rewarded). The next estimate is made with the new norms. According to this iteration, the EB is frustrated at the lack of recognition, and either delays the project or resigns.
- The "Smart-alec" Project Manager Scenario
An estimate is given to the smart-alec (SA) project manager. The SA weighs the situation and delays the project until a point of penalty avoidance. The project postmortem concludes that the estimate is an overestimate, and estimation norms are loosened (instead of the SA being punished). The next estimate is made with the new norms. According to this iteration, the SA continues to follow this pattern of behavior, knowing that it is successful. The project office keeps loosening the estimation norms until the marketing department complains of high quotes.
- The Purely Pragmatic Project Manager Scenario
An estimate is given to the purely pragmatic (PP) project manager. The PP plans his work to meet the estimate. The project postmortem concludes that the estimate is the right estimate, and estimation norms are retained. According to this iteration, it in fact is never known whether or not the estimation norms were right in the first place.
We see that it is not clear that any of the estimates in any of the iterations are the right ones! The validation of estimates through comparison with actuality does not produce the right norms.
Paradox of Uncertainty
Uncertainty is inherent in any human activity, with very few exceptions. That is why Murphy's Law ("if something can go wrong, it will") is generally accepted. And of course any project faces some fundamental uncertainties, including technical, timeline, and repeat business variables. Also needing to be considered are variables having to do with cost (or profit, or budget), and quality and reliability.
Planning a project also contains uncertainty, with respect to completeness and stability of requirements; soundness of design; development platform reliability; team attrition; attrition of key personnel at the client; uncertain productivity; and unspecified or under-specified customer expectations.
The following are some of the uncertainties we face when estimating effort and duration (in other words, the schedule): productivity, size definition, team skill level, team effort level, and the paradox of "averages." How is an average defined? There are a plethora of ways, including a simple average, a weighted average (simple or complex), a moving average, a statistical mean, a statistical mode, and a statistical median.
How do we compute the probability of success (or risk of failure)? Which of the probability distributions do we use? Normal, binomial, Poisson, beta, gamma, or t-distribution? The general practice is to use normal or beta distributions, but there are concerns about their suitability.
The purpose of this discussion is to bring to the surface all the paradoxes we face in the software development industry. It is also to focus efforts on resolving these paradoxes and bringing about industry standards. The time is ripe: there are now enough well-understood software size measures. There is also an adequate number of metrics being generated which are available to researchers to analyze.
About the Author
Murali Chemuturi is a fellow of industrial engineering with the Indian Institution of Industrial Engineering. His career has spanned over thirty years of experience with professional organizations, including ECIL, TCS, Metamor, and Satyam. He worked initially in manufacturing, and then in IT. He is presently leading Chemuturi Consultants, focusing on software products for the software development industry. He has conducted a number of in-company training programs for software project management and software estimation. He can be reached at email@example.com.