Troublesome software metrics cause errors in estimates

介绍

The software industry is one of the largest and most successful industries in history. 但是软件应用程序是历史上最昂贵，最容易出错的制造对象之一. 软件需要时间表的准确估计, 费用, 和质量. These are hard to achieve if the fundamental metrics are wrong and distort reality. 作为 2014 the software industry labors under a variety of non-standard and highly inaccurate measures and metrics compounded by very sloppy measurement practices. The errors in software measures and metrics cause errors in software estimates. Following are descriptions of the more troubling software metrics topics in alphabetical order from my point of view. Two of them are both widespread and troublesome:

Cost per defect penalizes quality
Lines of code penalize high-level languages and make requirements and design invisible

麻烦的指标

Cost per defect metrics penalize quality and makes the buggiest software look cheapest. There are no ISO or other standards for calculating cost per defect. Cost per defect does not measure the economic value of software quality. The urban legend that it costs 100 times as much to fix post-release defects as early defects is not true and is based on ignoring fixed costs. Due to fixed costs of writing and running test cases, cost per defect rises steadily because fewer and fewer defects are found. This is caused by a standard rule of manufacturing economics: “if a process has a high percentage of fixed costs and there is a reduction in the units produced, the cost per unit will go up.” This explains why cost per defects seems to go up over time even though actual defect repair costs are flat and do not change very much. There are of course very troubling defects that are expensive and time consuming, but these are comparatively rare.

缺陷密度 metrics measure the number of bugs released to clients. There are no ISO or other standards for calculating defect density. One method counts only code defects released. A more complete method includes bugs originating in requirements and design as well as code defects, and also includes “bad fixes” or bugs in defect repairs themselves. There can be more than a 300% variation between counting only code bugs and counting bugs from all sources.

Function point metrics were invented by IBM circa 1975 and placed in the public domain circa 1978. Function point metrics do measure economic productivity using both “work hours per function point” and “function points per month”. They also are useful for normalizing quality data such as “defects per function point”. However there are numerous function point variations and they all produce different results: Automatic, backfired, COSMIC, 快速, 菲斯玛, 联合会, 马克二世, NESMA, Unadjusted, 等等. There are ISO standards for COSMIC, 菲斯玛, 联合会, 马克二世和 NESMA. However in spite of ISO standards all five produce different counts. Adherents of each function point variant claim “accuracy” as a virtue but there is no cesium atom or independent way to ascertain accuracy so these claims are false. For example COSMIC function points produce higher counts than IFPUG function points for many applications but that does not indicate “accuracy” since there is no objective way to know accuracy.

Goal/Question metrics (GQM) were invented by Dr. Victor Basili of the University of Maryland. The concept is appealing. The idea is to specify some kind of tangible goal or target, and then think of questions that must be answered to achieve the goal. This is a good concept for all science and engineering and not just software. 然而, since every company and project tends to specify unique goals the GQM method does not lend itself to either parametric estimation tools or to benchmark data collection. It would not be difficult to meld GQM with function point metrics and other effective software metrics such as defect removal efficiency. For example several useful goals might be “How can we achieve defect potentials of less than 1.0 per function point?” or “How can we achieve productivity rates of 100 function points per month?” Another good goal which should actually be a target for every company and every software project in the world would be “How can we achieve more than 99% in defect removal efficiency?”

Lines of code (LOC) metrics penalize high-level languages and make low-level languages look better than they are. LOC metrics also make requirements and design invisible. There are no ISO or other standards for counting LOC metrics. About half of the papers and journal articles use physical LOC and half use logical LOC. The difference between counts of physical and logical LOC can top 500%. LOC metrics make requirements and design invisible and also ignore requirements and design defects, which outnumber code defects. Although there are benchmarks based on LOC, the intrinsic errors of LOC metrics make them unreliable. Due to lack of standards for counting LOC, 来自不同供应商的相同应用程序的基准测试可能包含截然不同的结果.

Story point metrics are widely used for agile projects with “user stories.” 故事点 have no ISO standard for counting or any other standard. They are highly ambiguous and can vary by as much as 400% from company to company and project to project. There are few useful benchmarks using story points. Obviously story points can’t be used for projects that don’t utilize user stories so they are worthless for comparisons against other design methods.

Technical debt is a new metric and rapidly spreading. It is a brilliant metaphor developed by Ward Cunningham. The concept of “technical debt” is that topics deferred during development in the interest of schedule speed will cost more after release than they would have cost initially. However there are no ISO standards for technical debt and the concept is highly ambiguous. It can vary by over 500% from company to company and project to project. Worse, technical debt does not include all of the costs associated with poor quality and development short cuts. Technical debt omits canceled projects, consequential damages or harm to users, and the costs of litigation for poor quality.

用例点 are used by projects with designs based on “use cases” which often utilize IBM’s Rational Unified Process (RUP). There are no ISO standards for use cases. 用例点 are ambiguous and can vary by over 200% from company to company and project to project. Obviously use cases are worthless for measuring projects that don’t utilize use cases, so they have very little benchmark data.

Defining Software Productivity

For more than 200 years the standard economic definition of productivity has been, “Goods or services produced per unit of labor or expense.” This definition is used in all industries, but has been hard to use in the software industry. For software there is ambiguity in what constitutes our “goods or services.”

The oldest unit for software “goods” was a “line of code” or LOC. More recently software goods have been defined as “功能点”. Even more recent definitions of goods include “故事点” and “use case points”. 该 pros and cons of these units have been discussed to a large extent in literature.

Another important topic taken from manufacturing economics has a big impact on 软件生产力 that is not yet well understood even in 2014: fixed costs.

A basic law of manufacturing economics that is valid for all industries including software is the following: “When a development process has a high percentage of fixed costs, and there is a decline in the number of units produced, the cost per unit will go up.”

When a “line of code” is selected as the manufacturing unit and there is a switch from a low-level language such as assembly to a high level language such as Java, there will be a reduction in the number of units developed.

But the non-code tasks of requirements and design act like fixed costs. Therefore the cost per line of code will go up for high-level languages. This means that LOC is not a valid metric for measuring economic productivity.

For software there are two definitions of 生产率 that match standard economic concepts:

Producing a specific quantity of deliverable units for the lowest number of work hours.
Producing the largest number of deliverable units in a standard work period such as an hour, 月, or year.

In definition 1 deliverable goods are constant and work hours are variable.

In definition 2 deliverable goods are variable and work periods are constant.

Defining Software Quality

As we all know the topic of “质量” is somewhat ambiguous in every industry. Definitions for quality can encompass subjective aesthetic quality and also precise quantitative units such as numbers of defects and their severity levels.

Over the years software has tried a number of alternate definitions for quality that are not actually useful. For example one definition for software quality has been “conformance to requirements.”

Requirements themselves are filled with bugs or errors that comprise about 20% of the overall defects found in software applications. Defining quality as conformance to a major source of errors is circular reasoning and clearly invalid. We need to include requirements errors in our definition of quality.

Another definition for quality has been “fitness for use.” But this definition is ambiguous and cannot be predicted before the software is released, or even measured well after release.

Another definition for software quality has been a string of words ending in “…ility” such as reliability and maintainability. However laudable these attributes are, they are all ambiguous and difficult to measure. 更远, they are hard to predict before applications are built.

An effective definition for software quality that can be both predicted before applications are built and then measured after applications are delivered is: “Software quality is the absence of defects which would either cause the application to stop working, or cause it to produce incorrect results.”

Because delivered defects impact reliability, maintainability, usability, fitness for use, conformance to requirements, and also customer satisfaction any effective definition of software quality must recognize the central importance of achieving low volumes of delivered defects. Software quality is impossible without low levels of delivered defects no matter what definition is used.

关于作者:

雀跃琼斯 is CTO of Namcook Analytics, 一家建立高级风险的公司, 质量, 和成本估算工具. This blogpost was originally posted on the Namcook Analytics blog.

博文代表作者个人观点
而并不一定与官方NESMA政策一致.

1 评论

发表评论

让·皮埃尔·法约勒说:

04/12/2014 在 11:23

Always nice to read something from Capers.

Just one comment about Technical Debt: that it is not ISO does not mean this is not usable, and some people have built some good methodology around it: http://www.sqale.org/

I have been using it to define some refactoring plans and estimate the effort of refactoring a Legacy C application (http://qualilogy.com/blog/legacy-application-refactoring-sqale-plugin-2/).

Now it`s right that everybody has his own way to measure the technical debt, so you will get different results according to different software editors. 在我看来, this is a practical tool for a project team to verify there is no big drift with each new release. But it can be helpful for the management when properly used and explained. Don’t use the numbers out of the box.

登录以回复

发表评论取消回复

你一定是登录发表评论.

麻烦的软件指标导致估计错误

介绍