I've had some interesting conversations recently about software development metrics, in particular how they can be used in a reasonably large organisation to help development teams work better. I know there have been Stack Overflow questions about which metrics are good to use - like this one, but my question is more about which metrics are useful to which stakeholders, and at what level of aggregation.
As an example, my view is that code coverage is a useful metric in the following ways (and maybe others):
But I don't think it's useful for senior management to see this on a team-by-team basis, as this encourages artifical attempts to bolster coverage with tests that merely exercise, rather than test, code.
I'm in an organisation with a couple of levels in its management hierarchy, but where the vast majority of managers are technically minded and able (with many still getting their hands dirty). Some of the development teams are leading the way in driving towards agile development practices, but others lag, and there is now a serious mandate from the top for this to be the way the organisation works. A couple of us are starting a programme to encourage this. In this sort of an organisation, what sort of metrics do you think are useful, to whom, why, and at what level of aggregation?
I don't want people to feel their performance is being assessed based on a metric that they can artificially influence; at the same time, the senior management are going to want some sort of evidence that progress is being made. What advice or caveats can you provide based on experience in your own organisations?
EDIT
We are definitely wanting to use metrics as a tool for organisational improvement not as a tool for individual performance measurement.
What are Software Metrics? Software development metrics are quantitative measurements of a software product or project, which can help management understand software performance, quality, or the productivity and efficiency of software teams.
The researchers have determined that only four key metrics differentiate between low, medium and high performers: lead time, deployment frequency, mean time to restore (MTTR) and change fail percentage.
Software metrics are related to the four functions of management: Planning, Organization, Control, or Improvement.
Metric Classification Software metrics can be divided into two categories; product metrics and process metrics.
A tale from personal experience. Apologies for the length.
A few years ago our development group tried setting "proper" measurable objectives for individuals and team leaders. The experiment lasted for just one year, because hard metrics didn't really work very well for individual objectives (see my question on the subject for some links and further discussion).
Note that I was a team leader, and involved in planning it all with my technical boss and the other team leaders, so the objectives weren't something dictated from on high by clueless upper management -- at the time we really wanted them to work. It is also worth noting that the bonus structure inadvertently encouraged competition between developers. Here are my observations on the things we tried.
Customer-visible issues
In our case, we counted outages on the service we provided to customers. In a shrink-wrapped product it might be the number of bugs reported by customers.
Advantages: This was the only real measure that was visible to upper management. It was also the most objective, being measured outside the development group.
Disadvantages: There weren't that many outages -- just around one per developer for the whole year -- which meant that failing or exceeding the objective was a matter of "pinning blame" for the few outages that did occur in each team. This led to bad feeling and loss of morale.
Amount of work completed
Advantages: This was the only positive measure. Everything else was "we notice when bad things happen," which was demoralising. Its inclusion was also necessary because, without it, a developer who did nothing all year would exceed all the other objectives, which clearly wouldn't be in the interests of the company. Measuring the amount of work completed checked the natural optimism of developers when estimating task size, which was useful.
Disadvantages: The measure of "work completed" was based on estimates provided by the developers themselves (usually a good thing), but making it part of their objectives encouraged gaming of the system to inflate estimates. We had no other viable measure of work completed: I think the only possible valuable way of measuring productivity is "impact on the company bottom line," but most developers are so far removed from direct sales that this is rarely practical at an individual level.
Defects found in new production code
We measured defects introduced into new production code during the year, as it was felt that bugs from previous years should not count against any individual in this year's objectives. Defects spotted by internal quality teams were included in the count even if they didn't impact customers.
Advantages: Surprisingly few. The time lag between the introduction of a defect and its discovery meant that there was really no immediate feedback mechanism to improve code quality. Macro trends at a team level were more useful.
Disadvantages: There was a heavy focus on the negative, since this objective was only invoked when a defect was found and we needed someone to blame for it. Developers were reluctant to record defects they found themselves, and a simple count meant that minor bugs were as bad as severe problems. Since the number of defects per individual was still quite low, the number of minor and severe defects didn't even out as it might with a larger sample. Old defects were not included, so the group's reputation for code quality (based on all bugs found) did not always match the measurable introduced-this-year count.
Timeliness of project delivery
We measured timeliness as the percentage of work delivered to internal QA teams by the stated deadline.
Advantages: Unlike counting defects, this was a measure that was under immediate, direct control of the developers, as they effectively decided when the work was complete. The presence of the objective focused the mind on completing tasks. This helped the team commit to realistic amounts of work, and improved the perception by internal customers of the development group's ability to deliver on promises.
Disadvantages: As the only objective directly under the developers' control, it was maximised at the expense of code quality: on the day of a deadline, given the choice between saying a task is complete or doing further testing to improve confidence in its quality, the developer would choose to mark it complete and hope any resulting bugs never come to the surface.
Complaints from internal customers
To gauge how well developers communicated with internal customers during development and subsequent support of their software, we decided that the number of complaints received about each individual would be recorded. The complaints would be validated by the manager, to avoid any possible vindictiveness.
Advantages: Really nothing I can recall. Measured at a sufficiently large group level it becomes a more useful "customer satisfaction" score.
Disadvantages: Not only highly negative, but also a subjective measure. As with other objectives, the numbers for each individual were around the zero mark, which meant that a single comment about someone could mean the difference between "infinitely exceeded" and "did not meet".
General comments
Bureaucracy: While our task management tools held much of the data for these metrics, there was still quite a lot of manual effort involved to collate it all. The time spent obtaining all the numbers was not enjoyable, generally focused on negative aspects of our work and may not even have been reclaimed by increased productivity.
Morale: For the measures where individuals were blamed for problems, not only did those with "bad" scores feel demotivated, but so did those with "good" scores, as they didn't like the loss in team morale and sometimes felt they were ranked higher not because they were better but because they were luckier.
Summary
So what did we learn from the episode? In later years we tried to re-use some of the ideas but in a "softer" way, where there was less emphasis on individual blame and more on team improvement.
All of this doesn't help when you are required to set measurable objectives for individual developers, but hopefully the ideas will be more useful for team improvement.
The key thing about metrics is knowing what you are using them for. Are you using them as a tool for improvement, a tool for reward, a tool for punishment, etc. It sounds like you're planning to use them as a tool for improvement.
The number one principle when setting metrics is to keep the information relevant so that the person receiving it can use it to make a decision. Most likely a senior manager cannot dictate the micro level of whether you need more tests, less complexity, etc. But a team leader can do that.
Therefore, I don't believe a measure of code coverage is going to be useful to management beyond the individual team. At the macro level, the organisation is probably interested in:
Internal quality won't be high on their list of things to cover off. It's a development team's mission to make it clear that internal quality (maintainability, test coverage, self-documenting code, etc) is a key factor in achieving the other three.
Therefore you should target metrics to more senior managers which cover off those three such as:
And measure things like code coverage, code complexity, cut 'n' paste score (code repetition using flay or similar), method length, etc at a team level where the recipients of the information can really make a difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With