Thursday, June 17, 2010

Science Metrics

Nature has a very interesting News Feature on metrics for scientific achievement, titled Metrics: Do metrics matter? The use of scientific metrics is a recurring theme on this blog. I wrote about it most recently in my post Against Measure.

The main point of my criticism on science metrics is that they deviate researchers' interests. It is what I refer to as a deviation from primary goals to secondary criteria. Here, the primary goal is good research. The secondary criteria are some measures that for whatever reason are thought to be relevant quantifiers for the primary goal. The problem is that, even if the secondary criteria have initially had some relevance, their implementation inevitably affects researcher's own assessment of what success means and leads them to strive for the secondary criteria rather than the primary goal. With that, the secondary criteria become less and less useful since they are being pursued as an end in itself. Typical example: number of publications. In principle not a completely useless criterion to assess a researcher's productivity. But it becomes increasingly less useful the more tricks scientists pull to increase the number of publications instead of focusing on the quality of their research.

Note that for a deviation of interests to happen it is not necessary that the measures are actually used! It is only relevant that researchers believe they are used. It's a sociological effect. You can cause such believes by simply doing much talk about science metrics. The better known a measure is, the more likely people are to believe it has some relevance. It is a well known fact about human psychology that people pay attention to what they hear repeatedly.

Now Nature did a little poll asking readers how much they believe science metrics are used at their institution for various purposes. 150 readers responded; the results are available here. They then contacted scientists in administrative positions at nearly 30 research institutions around the world and asked them what metrics are being used, and how heavily they are relied on. In a nutshell the administrators claim that metrics are being used much less than scientists believe they are.

"The results suggest that there may be a disconnect between the way researchers and administrators see the value of metrics."
While this is an interesting suggestion, it is not much more than a suggestion. It is entirely unclear whether the sample of people who replied to the poll had a good overlap with the sample of administrators being asked. By such a small sample size the distribution of people in both groups over countries matters significantly. It remained unclear to me from the article whether in their contacting of institutes they have made sure that the representation of countries is the same as that of the poll's participants, and also if the distribution of research fields is the same. If not, the mismatch between the administration and the researchers might simply show national differences or differences between fields of research. Also, it is conceivable that people who filled out the questionnaire had some concerns about the topic to begin with, while this would not have been the case for people contacted. It did not become clear to me how the poll was publicized.

In any case, given what I said earlier, we should of course appreciate the suggestion of these results. Please do not believe that science metrics matter for your career!

19 comments:

Jorge Pullin said...

As a professor of music friend told me "I discovered my h is zero, what will happen to me?"

Bee said...

Well, with h=0, there's no uncertainty ;-)

Daniel Lemire said...

Similarly, we tend to want to earn more money, more and more. Even though it has been shown, repeatedly, that excess money does not bring happiness.

Bee said...

Hi Daniel,

Yes, that's a similar divergence between primary goals (happiness) and secondary criteria (wealth). Though in this case there's an additional effect: research has also shown that happiness does depend on relative wealth, ie it matters how much you have compared to other people. If you combine that with the increasingly better global connectivity allowing you to compare yourself to literally everybody, together with there being very few very rich people, you create a situation where everybody constantly strives to outperform most other people.

Come to think of it, now I'm wondering if maybe a similar effect plays a role with scientific metrics? In the sense that people's own sense of achievements would depend on what others around them have achieved, ie the relative performance rather than the absolute one? It would mean that it's not actually funding agencies or administrative procedures that cause the running in the hamster wheel, but instead it would be a self-created peer pressure effect. Best,

B.

Kay zum Felde said...

Hi Bee,

who will say, some work is really good, if it is a hypothesis ? And that is what should be counted: good work.

Best, Kay

Uncle Al said...

Management is process not product. Management is a stomach: it has no brain, it knows it is hungry, and the inevitable results are somebody else's problem. If you want to find the bottleneck, the first place to look is at the top of the bottle. The harder management squeezes the less juice comes out. Managers make decisions, workers make mistakes.

http://www.mazepath.com/uncleal/bp.jpg

"A large windmill requires about 538 pounds of neodymium to make the formable permanent magnets." "Each windmill magnet is about the size of a car engine and uses 560 pounds of neodymium." "a single 3-megawatt windmill requires more than 700 pounds of the Rare Earth metal neodymium." Let's save the Earth with legislation and Federal subsidies while resetting fossil fuel combustion to 1960s' levels during global economic recession. Then legislate the magic appearance of neodymium from other than Chinese mines. We can manage our way out of any challenge. Victory in Afghanistan next business quarter!

Bee said...

Hi Kay,

My point of view is simply that every scientist has an opinion on what good research is and the collection of all these (expert's) opinions is the only relevant criterion to asses the quality of a researcher or his/her work. Everything else is a distortion. (This is basically what I wrote in more detail in my post We have only ourselves to judge each other.) The quality of a work can be assessed (and very often is assessed) even if it is still on the level of a hypothesis. That is one of the main purposes of peer review to begin with. A scientific hypothesis or a prediction can be first of all interesting, well argued, based on a consistent model, etc etc. Also relevant is a thorough scan of the literature to see whether known problems with such an idea have been discussed before etc etc. These are all points btw that most people with their home-made theories are completely oblivious about (the most common problem is that their great idea has been tried a century ago and failed, but they never read a textbook past Einstein). There are arguably big and very apparent differences in research quality, just putting forward some hypothesis isn't going to impress anybody, you need to come up with a good justification why this is reasonable or even probable thing to expect and thus worth looking into. Best,

B.

jay said...

Bee, I don't know what is the answer for the issue you raised here. I guess you would feel this trend is inevitable in some sense. In first-tier institutions like MIT, Stanford, or Kyoto you don't need to look at quantitative data to make decisions on hiring or promotion since it is very easy to identify a small number of big fish. The situation is not quite so in 2nd- or 3rd-tier institutions. When you have a lot of fish in seemingly similar sizes, what would you do? Just throw dice? Here come in the apparently reasonable, but deep-down meaningless criteria to make decisions. You make a spreadsheet file, and quantify everything. Then we say we have a nice evaluation system. This data-driven evaluation system keeps us busy doing not much meaningful works.

In my opinion it would not give much difference whatever kind of evaluation system we have. The important thing is that we do not lose some potentially big fish (talented mavericks) amid constructing this petty evaluation system. Though I was a bit cynical in the above, personally I really hate this trend of spreadsheet evaluation system!

Pmer said...

Belief in these metrics contributes to groupthink.

Steven Colyer said...
This comment has been removed by the author.
Steven Colyer said...

Delete Comment From: Backreaction

Blogger Steven Colyer said...

I find the word "metric" to be strange. When did it come into vogue? I heard it for the first time 2 months ago, indeed I first heard it used here, by Bee. And now ... I hear it everywhere! Argh.

It means "measurement", right? Is measurement such a bad word? Why not just say "measurement?" Oh right, one extra syllable. God forbid we offend the Americans, with their less-is-more on-going destruction of Jolly Olde England's wonderful language.

It sounds like a neuveau-Business buzzword, like that most hated of all Business buzzwords: "synergy", with "Global" being a close second. God, I hate those words. I'm starting to feel the same way about "metric." Pfft.

More on topic, "metrics" means that computers can fire people, possibly for the wrong reasons, depending on the person who set up the system. Is that good?

Bee said...

Steven:

The word is very appropriate. A metric is not a measurement, it is a device that tells you how to make a measurement. On a manifold, it's a symmetric bilinear form (that's a thing with two indices where you can exchange the indices) that makes a space a metric space . What is the length of a vector x_a? Most people I guess would say it's |x| = \sqrt{\delta^{ab} x_a x_b}. That however is only true in flat euclidean space. More precisely you have no clue what the length of a vector is unless you've been given a metric. In Minkowski space the length would be \sqrt{\eta^{ab} x_a x_b}, and in curved space more generally \sqrt{g^{ab} x_a x_b}. Here, \delta^{ab}, \eta^{ab} and g^{ab} respectively are the metric. (I know that blogger doesn't compile LaTeX, I hope you can read it, otherwise look it up in a LaTeX manual). Best,

B.

Steven Colyer said...

Mmm, "Minkowski Space." I think that's my favorite "space" (after Phase Space, of course).

Thanks for reminding us, Bee.

:-)

Bee said...

Jay,

The problem with evaluating all the fish with one fast and easy evaluation system is centralization and streamlining. The more people use the same system, the more likely it becomes everybody will do the same research with the same methods. The obvious way to work against this is localization and deliberate heterogeneity. Clearly, the some few top institutions will grab off talented people worldwide because they can quite literally afford to buy them. But then, as you pointed out correctly, for the few really top people you don't need any metric anyway, it's obvious. The question is what does the bunch of high quality places that are not the exactly top few ones do?

I think if they all fish in the same pond with the same rod, that's the dumbest thing they can do. They will all end up doing the same, which doesn't only hinder progress, it also wastes resources on unnecessarily high competition. What they should do is confine their pond to a small sample and do a personal evaluation. The question is of course how to make the pond smaller. The most obvious thing for this is if candidates wouldn't apply for several hundred places at once, then these places wouldn't be faced with too many applications to actually read them. The less obvious thing to do is that different places show some character and put an emphasis on slightly different criteria. Take PI as an example. They (try to) put an emphasis on independent work rather than a lot of work. One can debate how useful that criterion is (I have my reservations) but that's not the point. The point is that they are doing something different and thus preserve diversity. The problem that I see arises when everybody applies the same criteria. This corresponds to an overconfidence on the accuracy of the measures one has used and can have quite disastrous long-term consequences. Best,

B.

Phil Warnell said...
This comment has been removed by the author.
Phil Warnell said...
This comment has been removed by the author.
Phil Warnell said...

Hi Bee,

I see once again we are discussing the merits of metrics when it comes to evaluating quality in scientific research. The strange thing of course is the problem doesn’t rest with what represents being good science, as that’s decided by the method itself. The problem is what represents as being a good scientist, which is more a matter of discovering intent, s outcome can never be quarantined. I equate this like in industry as being the difference between what’s known as quality control and quality assurance. Quality control serves tp give one standards that is expected for things to meet only as to consider if they are good enough for the end user’s purpose, while quality assurance is a more general commitment to accomplishing this that is only successful if those evolved actually care.

So then it comes down to how do we decide who cares or not and beyond this is there be any way we might have more to care. If looked at this way science has more semblance to religion than it would often be comfortable to admit. The irony of course being religion has similar problems, with many believing metrics alone can accomplish the goal, like having morality and conduct codified as it is with the ten commandments in the Judaeo-Christian tradition.

The thing is these at best serve as metrics like those found in quality control, which addresses only how quality might be recognized, yet has little utility as to have it assured. My point then being, goodness (quality) in science, like it is in religion can only be assured by its practitioners when it is appreciated as to be understood that it is not only self enlightenment that has it as beneficial, yet also in its practice benefits all, even if many of those it serves have found no reason to care; as truth in of itself has intrinsic value found not only in its discovery yet also with its pursuit.

Best,

Phil

Tim van Beek said...

All companies have the same problem, they have to evaluate the employees on a regular basis. Some tried metrics in the sense that the managers had to fill out a form for every employee that they were responsible for. This was not very successful, because one cannot characterize the performance of a human being with a fixed set of numbers or degrees. People have different strengths and weaknesses and different approches to problems. As far as I know nobody uses a system like that anymore (of course I know only about some of the millions of companies :-). Some use a system where boss and employee write down a target agreement for a year and that becomes a part of the evaluation.

Also, metrics are used for software, like e.g. the ration of comment lines to code lines (the higher the better). The common sense today is that some of these metrics give useful additional information, but that overall code quality can only be judged by programmers that review the code.

bmaher.sciwriter said...

Hi,
As one of the editors in charge of developing the poll and writing up our analysis of it, I really appreciate you taking notice. You are absolutely right to be cautious about the the results of our poll as it was a small sample of largely self-selecting respondents. Still, I think there are important suggestions within. I hope your blog continues the discussion.