Sunday, August 25, 2013

Can we measure scientific success? Should we?

My new paper.
Measures for scientific success have become a hot topic in the community. Many scientists have spoken out in view of the increasingly widespread use of these measures. They largely all agree that the attempt to quantify, even predict, scientific success is undesirable if not flawed. In this blog’s archive, you find me too banging the same drum.

Scientific quality assessment, so the argument goes, can’t be left to software crunching data. An individual’s promise can’t be summarized in a number. Success can’t be predicted on past achievements, look at all the historical counterexamples. Already Einstein said. I’m sure he said something.

I’ve had a change of mind lately. I think science need measures. Let me explain.

The problem with measures for scientific success has two aspects. One is that measures are used by people outside the community to rank institutions or even individuals for justification and accountability. That’s problematic because it’s questionable this leads to smart research investments, but I don’t think it’s the root of the problem.

The aspect that concerns me more, and that I think is the root of all evil, is that any measure for success feeds back into the system and affects the way science is conducted. The measure will be taken on by the researchers themselves. Rather than defining success individually, scientists are then encouraged to work towards an external definition of scientific achievement. They will compare themselves and others on these artificially created scales. So even if a quantifiable marker of scientific output was once an indicator for success, its predictive power will inevitably change as scientists work specifically towards it. What was meant to be a measure instead becomes a goal.

This has already happened in several cases. The most obvious examples are the number of publications or the number of research grants obtained. On the average, both are plausibly correlated with scientific success. And yet a scientist who increases her paper output doesn’t necessarily increase the quality of her research, and employing more people to work on a certain project doesn’t necessarily mean its scientific relevance increases.

A correlation is not a causation. If Einstein didn’t say that he should have. And another truth that comes courtesy of my grandma is that too much of a good thing can be a bad thing. My daughter reminds me we’re not born with that wisdom. If sunlight falls on my screen and I close the blinds, she’ll declare that mommy is tired. Yesterday she poured a whole bottle of body lotion over herself.

Another example comes from Lee Smolin’s book “The Trouble with Physics”. Smolin argued that the number of single authored papers is a good indicator for a young researcher’s promise. He’s not alone in this belief. Most young researchers are very aware that a single authored paper will put a sparkle on their publication list. But maybe a researcher with many single authored papers just a bad collaborator.

Simple measures, too simple measures, are being used in the community. And this use affects what researchers strive for, distracting them from their actual task of doing good research.

So, yes, I too dislike attempts to measure scientific success. But if we all agree that it stinks why are we breathing the stink? Why are not only funding agencies and other assessment ‘exercises’ using these measures, but why are scientists themselves using them?

Ask any scientist if they think the number of papers shows a candidate’s promise and they’ll probably say no. Ask if they think publications in high impact journals are indicators for scientific quality and they’ll probably say no. Look at what they do, and the length of the publication list and occurrence of high impact journals on that list is suddenly remarkably predictive of their opinion. And then somebody will ask for the h-index. The very reason that politically savvy researchers tune their score on these scales is that, sadly, it does matter. Analogies to natural selection are not coincidental. Both are examples of complex adaptive systems.

The reason for the widespread use of oversimplified measures is that they’ve become necessary. They stink, all right, but they’re the smallest evil among the options we presently have. They’re the least stinky option.

The world has changed and the scientific community with it. Two decades ago you’d apply for jobs by carrying letters to the post office, grateful for the sponge so wouldn’t have to lick all these stamps. Today you apply by uploading application documents within seconds all over the globe and I'm not sure they still sell lickable stamps. This, together with increasing mobility and connectivity, has greatly inflated the number of places researchers apply to. And with that, the number of applications every place gets has skyrocketed.

Simplified measures are being used because it has become impossible to actually do the careful, individual assessment that everybody agrees would be optimal. And that has lead me to think that instead of outright rejecting the idea of scientific measures, we have to accept them and improve them and make them useful to our needs, not to that of bean counters.

Scientists, in hiring committees or on some funding agency’s review panel, have needs that presently just aren’t addressed by existing measures. Maybe one would like to know what’s the overlap of some person’s research topics with those represented at a department? How often have they been named in acknowledgements? Do you share common collaborators? What administrational skills does the candidate bring? Is there somebody in my network who knows this person and could give me a firsthand assessment? Have they experience with conference organization? What’s their h-index relative to the typical h-index in a field? What would you like to know?

You might complain these are not measures for scientific quality and that’s correct. But science is done by humans. These aren’t measures for scientific quality, they’re indicators for how well a candidate might fit on an open position and into a new environment. And that, in return, is relevant for both their success and that of the institution.

Today, personal relations are highly relevant for successful applications. That is a criterion which sparks interest that is being used in absence of better alternatives. We can improve on that by offering possibilities to quantify, for example, the vicinity of research areas. This can provide a fast way to identify interesting candidates that one might not have heard of before.

And so I think “Can we measure scientific success?” is the wrong question to ask. We should ask instead what measures serve scientists in their profession. I’m aware there are meanwhile several alt-metrics being offered, but they don’t address the issue, they merely take into account more data sources to measure essentially the same.

That concerns the second aspect of the problem, the use of measures in the community. For what the first aspect is concerned, the use of measures by accountants who are not scientists themselves: The reason they use certain measures for success or impact is that they believe scientists themselves regard them useful. Administrators use these measures simply because they exist and because scientists, in lack of better alternatives, draw upon them to justify and account for their success or that of their institution. If you have argued that the value of your institute is in the amount of papers produced or conferences held, in the number of visitors pushed through or distinguished furniture bought, you’ve contributed to that problem. Yes, I’m talking about you. Yes, I know not using these numbers would just make matters worse. That’s my point: They’re a bad option, but still the best available one.

So what to do?

Feedback in complex systems and network dynamics have been studied extensively during the last decade. Dirk Helbig recently had a very readable brief review in Nature (pdf here) and I’ve tried to extract some lessons from this.
  1. No universal measures.
    Nobody has a recipe for scientific success. Picking a single measure bears a great risk of failure. We need a variety so that the pool remains heterogeneous. There is a trend towards standardized measures because people love ordered lists. But we should have a large number of different performance indicators.
  2. Individualizable measures.
    Measures must be possible to individualize, so that they can take into account local and cultural differences as well as individual opinions and different purposes. You might want to give importance to the number of single authored papers. I might want to give importance to science blogging. You might think patents are of central relevance. I might think a long-term vision is. Maybe your department needs somebody who is skilled in public outreach. Somebody once told me he wouldn’t hire a postdoc who doesn’t like Jazz. One size doesn’t fit all.
  3. Self-organized and network solutions
    Measures should take into account locations and connections in the various scientific networks, may that be social networks, coauthor networks or networks based on research topics. If you’re not familiar with somebody’s research, can you find somebody who you trust to give you a frank assessment? Can I find a link to this person’s research plans?
  4. No measure is ever final.
    Since the use of measures feeds back into the system, they need to be constantly adapted and updated. This should be a design feature and not an afterthought.
Some time between Pythagoras and Feynman, scientists had to realize that it had become impossible to check the accuracy of all experimental and theoretical knowledge that their own work depended upon. Instead they adopted a distributed approach in which scientists rely on the judgment of specialists for topics in which they are not specialists themselves; they rely on the integrity of their colleagues and the shared goal of understanding nature.

If humans lived forever and were infinitely patient then every scientists could trace down and fact-check every detail that their work makes use of. But that’s not our reality. The use of measures to assess scientists and institutions represents a similar change towards a networked solution. Done the right way, I think that measures can make science fairer and more efficient.

20 comments:

Captain InterStellar said...

Hi Sabine,

Interesting post. I think there can a problem with measuring scientific success if one goes by the number of papers published, quality is not measured by quantity. Let's say an institution published 100 papers which can't be verified by experiment, should this institution get tax payer funding? What benefit is there to Science if none of the theoretical work can be verified?

I would think that scientific success is measured by how much real positive impact the research has had or can have to our understanding of the Universe or how much real life changing benefits it has brought or can bring to people or how we do things. If the theoretical research is of high quality but can't be verified by experiment because it's a technology issue then it needs to be shelved and it's scientific success later determined when the time is right.


Cheers, Paul.

Uncle Al said...

Hectares of scholarly analysis paper the Sistine Chapel ceiling. Sepia light and shadow, characters' allegorical political intrigues are exhaustively explored. Soot removal exposed comic book-bright pigmentation. Global gender misassignments hid under post-painting censorship. The analyses are wrong.

Physics observes vacuum mirror-symmetry toward photons. Theory assumes vacuum mirror-symmetry toward matter. Theory suffers parity violations and chiral anomalies unendingly patched with symmetry breakings. An axiomatic system must be externally falsified, even if incomplete (Euclid vs. Bolyai) or unphysical (Newton vs. relativity and quantum mechanics). Theory that contradicts observation is wrong.

Opposite shoes violate the Equivalence Principle. Five distinct classes of experiment are exquisitely sensitive to trace vacuum chiral anisotropy toward matter, where physics never looked. A microwave chiral rotational spectrum experiment is one day. String/M-theory, SUSY, and dark matter cannot endure a tenth part-per-trillion divergence. How does one rate epicycles? Workmanship?

Phil Warnell said...
This comment has been removed by the author.
Phil Warnell said...

Hi Bee,

There is a lot I find good with what you say here with which I agree. However I think one of the things you seem to be skirting around, which is what also forms to be the most difficult (if not impossible) of all things to have distinguished by way of measure, being what has people to become scientists to begin with. I would say that the scrutiny of one’s work on its own, especially when a scientist is just beginning, is a completely unreliable measure, and thus why what it is that has someone brought to science should be given greater weight than it currently is. That is I think the character of the researcher to be important, as when that is found to be right it speaks as well to the issues with which many of the metrics are attempting to have measured and yet found wanting in being able to. Perhaps interestingly in such regard Einstein did have something to say, and yet he too was left somewhat baffled as to be able to define exactly what it was or imagine how it could be quantified and yet still found he recognized it whenever it was present.


” In the temple of science are many mansions, and various indeed are they that dwell therein and the motives that have led them thither. Many take to science out of a joyful sense of superior intellectual power; science is their own special sport to which they look for vivid experience and the satisfaction of ambition; many others are to be found in the temple who have offered the products of their brains on this altar for purely utilitarian purposes. Were an angel of the Lord to come and drive all the people belonging to these two categories out of the temple, the assemblage would be seriously depleted, but there would still be some men, of both present and past times, left inside. Our Planck is one of them, and that is why we love him.

I am quite aware that we have just now lightheartedly expelled in imagination many excellent men who are largely, perhaps chiefly, responsible for the buildings of the temple of science; and in many cases our angel would find it a pretty ticklish job to decide. But of one thing I feel sure: if the types we have just expelled were the only types there were, the temple would never have come to be, any more than a forest can grow which consists of nothing but creepers. For these people any sphere of human activity will do, if it comes to a point; whether they become engineers, officers, tradesmen, or scientists depends on circumstances.

Now let us have another look at those who have found favor with the angel. Most of them are somewhat odd, uncommunicative, solitary fellows, really less like each other, in spite of these common characteristics, than the hosts of the rejected. What has brought them to the temple? That is a difficult question and no single answer will cover it.”


-Albert Einstein, “Principles of Research”, Address at the Physical Society, Berlin, for Max Planck's 60th birthday (1918)


Regards,

Phil

scimom said...

The most crucial observation, which is however not addressed in this blogpost, is that there is no evidence the measure matters.

I come from HEP, where I had a chance to meet hundreds of people, many of which I got to know well in person. I have never seen a case in which the hiring/grants system failed. Every person that I have ever deemed worthy of a job or of a grant indeed got it plus minus small fluctuations. Then there are people in the gray zone, for which there may be bigger upwards and downwards fluctuations, but it does not matter anyway since such people do not end up making a mark on science.

Therefore, I dont see why we need to optimize these measures or chase our tails with feedback loops.

I would go as far as to say that I have not witnessed real injustice even in the postdoc market, just small fluctuations here and there.

Daniel Lemire said...

I'm with scimom. It is all good and well to ask for better and more thorough measures, and claim that this will make things better... but I also think that we need to assess these claims empirically.

Is it true that "better" measures allow you to recruit better scientists that you would otherwise?

Peter Turney said...

"Is it true that "better" measures allow you to recruit better scientists that you would otherwise?"

How can you even answer that question, except by introducing another measure, and comparing the given measure to the new measure?

B Yen/Getty Images [ iTunes demo ] said...

"I'm Locally Pessimistic, Globally Optimisitc"
-- Dr Jordan Pollack, Brandeis Univ Computer Science Prof, former grad-school colleague of mine (UIUC)

He made this statement during an interesting Slashdot.org interview. Current state of research-sector, is based on INCOMPLETE data-set leading to temporal theories. "Ill-conditioned problem" -- there are multiple theories that can explain the partial data-set. Successively accumulated data always asks "more questions", leading to modified theories (or new ones altogether)

How is it possible to "measure" progress, if one doesn't know the "global data-set"? The analogy of successively peeling onion layers is used to describe Science process, you can't figure out "where you are", by the current "layers of development". I suppose you can extrapolate given the previous "conquered layers"..but still there's a paradox.

"Predicting the Future..is PREDICTABLE UNPREDICTABLE"

Look at how any sector has developed, especially Technology. Some of the stuff was completely "out of the blue", serendipitous/random.

See next post for THREE Mass-Consumer market sectors, that came from US Military "seeding". NO WAY was that predictable. How could this "future development" been predicted back then (60's), or "progress measured"? I think the model is Evolution,

"It's not the Smartest, the Strongest, but the Species that is most ADAPTABLE TO CHANGE, that survives"
-- Charles Darwin

B Yen/Getty Images [ iTunes demo ] said...

More interesting comments by Dr Pollack from Edge.org

http://edge.org/response-detail/10553

"A measurement of innovation rate.

There is no measure of the rate at which processes like art, evolution, companies, and computer programs innovate.

Consider a black box that takes in energy and produces bit-strings. The complexity of a bit-string is not simply its length, because a long string of all 1's or all 0's is quite simple. Kolmogorov measures complexity by the size of the smallest program listing that can generate a string, and Bennet's Logical Depth also accounts for the cost of running the program. But these fail on the Mandelbrot Set, a very beautiful set of patterns arising from a one-line program listing. What of life itself, the result of a simple non-equilibrium chemical process baking for quite a long time? Different algorithmic processes (including fractals, natural evolution, and the human mind) "create" by operating as a "Platonic Scoop," instantiating "ideals" into physical arrangements or memory states.

So to measure innovation rate (in POLLACKS) we divide the P=Product novelty (assigned by an observer with memory) by the L=program listing size and the C= Cost of runtime/space/energy.

Platonic Density = P / LC"

"Knowing is not enough, we must Apply"
"Willing is not enough, we must Do"
-- Bruce Lee

Uncle Al said...

Fabricate fashionable rating paradigms. Hire only quantified safely qualified, countably productive ink pushers. 1935 Wallace Carothers discovered nylon, abominating proper knowledge. He also committed suicide. Ban nylon.

Eldritch talents are intellectual terrorists. The stuff that was coming out of him consisted of words, but it was not speech in the true sense: it was a noise uttered in unconsciousness, like the quacking of a duck..., 1984. Management says, "Quack, damn you."

Arun said...

W. Edwards Deming, "Out of the Crisis", ~1985,

Measures of productivity are like statistics on accidents: they tell you all about the number of accidents in the home, on the road and at the workplace, but they do not tell you how to reduce the frequency of accidents.

...
Some leaders forget an important mathematical theorem that if 20 people are engaged on a job, 2 will fall at the bottom 10 per cent, no matter what. It is difficult to overthrow the law of gravitation and laws of nature. The important problem is not the bottom 10 per cent, but who is statistically out of line and in need of help.

...
...
No one can put in his best performance unless he feels secure.

Phillip Helbig said...

I agree with most of your criticism of various measures, but at the end of the day you need to decide whom to allocate funding to. Giving it to everyone who says "I'm a scientist; give me money" won't work.

Sabine Hossenfelder said...

Hi Phil,

I agree with you insofar as that almost all of these measures are useless if you're dealing with young people (where by 'young' I don't necessarily mean young by age, but new to the field). But while I agree that motives and motivation is relevant for success, I don't think it makes a good criterion. I've known a bunch of very motivated students who were just, well, not very good at physics. One could say, it's necessary but not sufficient, and since it's necessary it doesn't seem to make much sense 'measuring' it. Which is to say, essentially, if somebody is applying for a job it makes sense to assume that they want to work in the field.

Having said that, the motivation to work at any particular place is another thing and in fact something that is dealt with in unclear ways. For example, we get applications from basically all over the globe but, not so surprisingly, the ones with ties to North Europe are more likely to accept an offer. And this again touches on a much bigger problem which is the annual dance of offers and deadlines which, if you ask me, is a totally archaic procedure.

Best,

B.

Sabine Hossenfelder said...

Hi Scimom,

I hear you but can't say the same is true for me. I know many people who I deem 'worthy' of a grant or a job and who didn't get those, and on the other hand I know a bunch of dull people who easily got these. For what I am concerned, there is clearly something rotten in the procedure. Note that I'm doing what the NSF calls 'transformative' research, which might explain the difference. Best,

B.

Sabine Hossenfelder said...

Daniel,

I totally agree that these claims must be assessed empirically. The problem is though that the relevant people, those hiring firing and handing out grants, don't see the need to evaluate and possibly rethink their procedures.

I don't know if better measures would allow you to recruit 'better' scientists. What I've tried to say in my blogpost is that that's the wrong question to ask. We should ask what measures can do to make science run more efficiently, by saving researchers time and effort with what they are already doing anyway. I also think that using measures can prevent sloppiness that comes from the attempt of researchers to save time and effort. That's how I think using measures will help science. Best,

B.

Phillip Helbig said...

"Smolin argued that the number of single authored papers is a good indicator for a young researcher’s promise. He’s not alone in this belief. Most young researchers are very aware that a single authored paper will put a sparkle on their publication list. But maybe a researcher with many single authored papers just a bad collaborator."

I would say that single-author papers from a young researcher are an indication of quality. However, the converse is not true. There are fields in which almost all papers must have multiple authors, and there are the "Lennon and McCartney types" who always have more than one name on the paper, for various reasons.

Sabine Hossenfelder said...

Phillip,

Yes, that matches very well with my experience.

Phil Warnell said...
This comment has been removed by the author.
Phil Warnell said...

Hi Bee,

I wasn’t contending one should place motive and or motivation as to being the overriding factor, more just to suggest with all else being equal (that is more or less), it should form to be what tips the scale. Moreover, I do (and did) agree that such cannot be accessed (well at least at present) by some now known recognized test or method, yet rather something which is considered as to be explored in the final stages of selection; with respect to personal interviews of the whittled down list of candidates and if possible with those who have lent them their personal recommendation. I also think this was the message that Einstein was attempting to have communicated, as recognizing science not simply as an occupation, but rather more so as being a calling, as it’s base formed as one being a philosophy. The contention being that it’s then best represented by those who find its purpose is found in serving to benefit humanity as a whole, rather than simply the individual, or the ambitions of the particular employer (institution) for that matter.

”Anybody who has been seriously engaged in scientific work of any kind realizes that over the entrance to the gates of the temple of science are written the words: Ye must have faith. It is a quality which the scientist cannot dispense with.”

-Max Planck, “Where Is Science Going?”

Regards,

Phil

amused said...

Hi Bee, a few belated comments on this:

"Most young researchers are very aware that a single authored paper will put a sparkle on their publication list"

Having been such a young researcher I think it is more accurate to say:
"A few naive young researchers think that single authored papers will add sparkle to their publication list, but then after they publish the single authored paper(s) they discover that the reality is not what they expected..."

One of my favorite rhetorical questions to ask Sean Carroll and others of his ilk back in the good old days on Cosmicvariance was:
"In the competition for jobs, how many single-author publications in PRL on a non-string theory topic does it take to balance out a single publication in, say, PRD by a young person as junior collaborator of a senior prominent string theorist and with a bunch of other coauthors on the paper?
(Answer: no number of single-author PRLs will be sufficient.)

Most young researchers are clued on enough to instinctively realize that it will be much better for their careers to be working as junior collaborator and coauthor with prominent senior researchers rather than writing papers on their own. So they don't bother to write singe-author papers, even though some of them probably could if they wanted to.

Regarding measures in general, one thing i think it is important to recognize is that you can get pretty much any outcome you want by choosing the measures in a suitable way. In fact I have the impression that quite often decision makers informally decide the outcome they want (based on their gut feelings mostly) and then chose a collection of measures that produces the desired outcome. It would be more efficient to just put them all together in a room and tell them to make the decision based on their collective gut instinct without the bother of having to cook up measures to formally justify it.

Another thing i think it's important to recognize is that when you or me or anyone else makes proposals for how to measure performance it is inevitable that we are going to favor measures that make us look good personally.
For example, I am convinced deep down in my gut that number of single-author publications in PRL should count for more than anything else when assessing theoretical physicists, especially junior ones. That also just happens to be the measure that would favor me most compared to my peers, but i'm sure that's just a coincidence ;-)
On the other hand, I shuddered when i read the measures you were proposing in the post because i know that I would score terribly on them. But maybe you would score very well on them? ;-)