How to select junior (or other) researchers, and why not to use Impact Factors

[ UPDATE: a commentary based on this blog post has now been published in the Journal of Informetrics at http://www.sciencedirect.com/science/article/pii/S1751157717302365 ]

Recently a preprint was posted at ArXiv to explore the question “Can the Journal Impact Factor Be Used as a Criterion for the Selection of Junior Researchers?”. The abstract concludes as follows:

The results of the study indicate that the JIF (in its normalized variant) is able to discriminate between researchers who published papers later on with a citation impact above or below average in a field and publication year - not only in the short term, but also in the long term. However, the low to medium effect sizes of the results also indicate that the JIF (in its normalized variant) should not be used as the sole criterion for identifying later success: other criteria, such as the novelty and significance of the specific research, academic distinctions, and the reputation of previous institutions, should also be considered.

In this post, I aim to explain why this is wrong (and more, how following this recommendation may retard scientific progress) and I have a go at establishing a common sense framework for researcher selection that might work.

First, let me emphasize that it’s commendable that Bornmann and Williams posted their paper on a preprint server. Psychology now has a dedicated preprint server, too: PsyArXiv, ran by the Center for Open Science, who also run the Open Science Framework. Consistently posting papers on preprint servers enables having discussions when they can still have some use for the final paper. Also, of course, it increases the attention for your work considerably. So well done them!

At the same time, it exposes you to more scrutiny than you’ll receive just from the peer reviewers, which may take some getting used to. So, in that vein, let me get back to the problems with Bornmann and Williams’ paper.

Journal Impact Factor is no proxy for scientific quality

First, while they acknowledge some criticism of the Journal Impact Factor (JIF), they only mention two problems:

JIF is only computed for five years (and therefore, only captures a paper’s “citation lifetime” for very fast-moving disciplines)
JIF is not representative for most papers in a journal (they mention that the JIF is an average computed for a skewed distribution; but the huge within-journal variation in citations also suffices to make this point)

They miss another important point: JIF is not indicative of the quality of scientific research: if anything, quite the opposite. An excellent overview by Brembs and Munafo (2013), “Deep impact: unintended consequences of journal rank”, nicely illustrates this. I find the most illustrative plot this one:

The Y axis shows power; and X axis, JIF. There is no association whatsoever between study quality and JIF. In addition, a pattern has been observed where initial findings are often published in high JIF (hi-JIF) journals, while subsequent studies are published in lower JIF (lo-JIF) journals typically find smaller effect sizes (or fail to replicate the original effect). Hi-JIF journals also retract papers more frequently. In other words: peer review at hi-JIF journals seems biased towards novelty/sensation (or perhaps citability/“mediability”), which automatically means it’s biased against methodological quality. After all, rigorous, high-powered studies are much less likely to find high effect sizes than underpowered studies (because the effect sizes’ sampling distributions are more narrow, see e.g. Peters & Crutzen, 2017; also see Christley, 2010). There seems to be little reason to assume JIF is indicative of scientific quality. This, of course, is the biggest problem of the Bornmann and Williams paper: they assume that hi-JIF publications are somehow superior to lo-JIF publications. A common misconception, but still just a misconception.

Note that among bibliometricians (and scientometricians perhaps), criticism of the JIF seems to focus on statistical properties of the JIF. The points raised by Brembs and Munafo appear neither in the Bornmann and Williams paper nor in the Waltman and Traag preprint they cite, and from which they quote that the JIF “is a more accurate indicator of the value of an article than the number of citations the article has received”. Waltman and Traag also make the point that “recognize that there is a significant amount of evidence of the undesirable consequences of the prominent role played by the IF in many fields of science”, but again, this doesn’t seem to relate to the inability of JIF to select for high-quality scientific work. This is perhaps taken to be a given in the field of bibliometrics? Perhaps this makes sense; after all, bibliometrics is a field in itself, so for a bibliometrician, establishing scientific quality in another field is probably not straightforward. I can even imagine that they automatically think “we can just use JIF as a proxy for that!” :-)

Correlation does not imply causation (not even with a disclaimer in your discussion section)

Even if you have a longitudinal design, correlation doesn’t imply causation. Bornmann and Williams took data from two periods: 1998-2002 (the ‘junior period’) and 2003-2012 (the ‘senior period’).

For the junior period, they computed which percentage of papers each researcher published in Q1 journals. For some reason, they didn’t use this percentage as predictor, but created four equally sizes groups. No idea why - all arguments against median splits hold for this ‘double median split’ approach as well: you throw away data, introduce ‘bright line cut offs’ that are unjustifiable, and so may distort your findings (see e.g. Maxwell et al., 1993, Altman et al., 2006, and/or MacCallum et al., 2002). They also look at number of published papers - but again, they create four groups, again without justifying why they don’t just use the variable in its continuous form.

Then, they use these two variables to predict their operarationalisation of scientific success. They have two (closely related) operationalisations, both based on the normalized number of citations (normalized citation scores, NCS). This is, for each paper, how much it is cited, corrected for discipline and publication year (by looking at mean citation per paper in a given Web of Science category and year). These NCS scores are then aggregated into a mean (MNCS) and total (TNCS) to get two indications of scientific success.

I agree with Bornmann and Williams that this probably get pretty close to the most feasible indicator of scientific success that you can get if you want to keep it ‘quick and dirty’ (read: automated processing of data from a database). One big problem is that they use Web of Science (WoS) data - WoS is owned by a commercial organisation (Clarivate Analytics), and they don’t include all journals, and therefore, will omit many relevant journals and publications, probably even being biased against certain topics and fields (also see Deep Impact for an amusing albeit unsettling anecdote). This considerably undermines the findings unless you assume that the journals in WoS are a random selection of all journals, which seems unlikely. A second problem is that papers don’t always get cited frquently because they represent excellent science; one example that serves to illustrate this point is Daniel Bem’s controversial “Feeling the Future” paper, which has been extensively criticized for its methodological (or rather, statistical) shortcomings (see e.g. the response by Wagenmakers et al., 2011), but is almost at 600 citations (Google Scholar count; WoS count will be lower because of the aforementioned omitted journals).

Anyway, these are their variables. They examine these for three cohorts to get some idea of any time anomalies, which I again think is a great idea. So in effect, they do all analyses three times (given their sample size of thousands of papers, multiple testing isn’t really an issue, even with very conservative corrections).

They use analysis of variance to explore whether their independent and dependent variables are associated. They report p values of 0 (wow, that’s really low :-)), their sample size in the thousands means that even trivial associations obtain low p values. Their effect size measures are more interesting. They report confidence intervals for η² (which is great, actually - everybody should report these!). They also report the Spearman r, although it’s not clear whether this is based on the categorized independent measures or the continuous version (if continuous, why not just use Pearson’s r? So I fear it’s based on the categorized versions). Also, because they don’t report confidence intervals for this latter measure, it’s not of much use, so I’ll leave it out here. The three confidence intervals they find for each of the four associations are:

Junior Q1 quartile predicting MNCS: [.03, .06] [.05, .08] [.04, .06]
Junior Q1 quartile predicting TNCS: [.02, .04] [.02, .04] [.02, .04]
Junior number of papers predicting MNCS: [.003, .014] [.005, .017] [.001, .009]
Junior number of papers predicting TNCS: [.07, .10] [.05, .08] [.05, .08]

The ‘tentative qualitative thresholds’ for η² that are usually applied take η² values around .01 to signify a small association, around .06 to signify a moderate association, and around .14 to signify a large association. Using these thresholds, the evidence from this dataset is consistent with trivial and weak associations in the population, although moderate associations might also be possible. However, even when looking at the upper bounds, the associations are quite weak: η² can be interpreted as the proportion of explained variance, and this is consistently very much below 10%.

They then look at NCS over time. I’ll just show the first figure, they’re quite comparable:

The low predictive value of number of papers for MNCS is clearly visible here; also, the higher predictive value of proportion of papers in Q1 can be seen in the left-most two panels (and how this effect disappears over time; and finally, number of papers as a junior consistently has some predictive value to predict total number of citations as a senior. Understandably, this effect becomes larger: after all, more papers collect more and more citations.

So, there are weak effects. The authors clearly take these to mean that selecting junior researchers using e.g. proportion of Q1 papers is a good idea (although they say that other sources should be used, too). After all, they conclude in their abstract:

The results of the study indicate that the JIF (in its normalized variant) is able to discriminate between researchers who published papers later on with a citation impact above or below average in a field and publication year - not only in the short term, but also in the long term. However, the low to medium effect sizes of the results also indicate that the JIF (in its normalized variant) should not be used as the sole criterion for identifying later success: other criteria, such as the novelty and significance of the specific research, academic distinctions, and the reputation of previous institutions, should also be considered.

There are two problems with this conclusion. First, these data imply that JIF is not a useful indicator of later success: the JIF explains very little variation in later success. But more important, secondly, the authors here ignore the consideration they mention in their discussion: because currently, JIF is often used as indicator for scientific quality, researchers who publish mainly in Q1 journals as a junior are likely to have more grant opportunities, get better jobs, and perhaps even get more research time. As long as there’s no way to exclude this, the conclusion that JIF is useful is not justified; it wouldn’t even be justified if the effect sizes were much larger. The potential third variables I listed here are just some I could think of off the top of my head: but many more may exist of course.

This is especially problematic if you’re looking for criteria to select researchers. If a researchers has hi-JIF publications and those publications facilitate future resources and opportunities, and those are in fact the causal antecedents of future success, then using “early JIF performance” as indicator actually creates the causal link, rather than leveraging causality that exists outside of your decision as a ‘selector’. If “junior success” is not indicative for some trait, characteristic or competence, but on likelihood of securing future resources, the problem is that you as the ‘selector’ are the resource and opportunity providing organisation! You won’t be reaping any benefits - you will simply create a self-fulfilling prophesy. And in doing so, forego opportunities for selecting on variables that are indicative of relevant characteristics.

The conclusion the authors draw basically promotes Campbell’s law:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

This is also one of the reasons why Brembs and Munafo argue against any journal ranking.

Another problem, and the reason why promoting reliance on JIF for researcher selection can retard scientific progress, is that because hi-JIF publications are not of higher scientific quality, and in fact, may be of lower quality, you’re selecting on the wrong variables. You’re selecting people who can, for example, sell their research well - but who may do relatively low quality research. You’re perpetuating, strengthening even, reliance on an indicator that seems negatively correlated with research quality.

But there is one more fundamental problem with Bornmann and Williams’ recommendation.

The importance of looking at the data

The authors conclude that the JIF can be a useful aid in selecting junior researchers. However, the low effect sizes imply that there is a lot of variation between researchers. This variation isn’t shown by Bornmann and Williams, however. This makes it very hard to get an idea of what the data are really telling us. Bornmann and Williams did not publish their data along with the preprint; I’ll ask them for it when I send them this blog post (after all, maybe it’s some use to them). With the raw data, it would be possible to look at how justifiable is it to base decisions for individual researchers on these results.

After all, these results are computed for (huge) groups of researchers. Selection of juniors, however, involves selecting one or perhaps a few researchers at most. In that sense, this subject matter is comparable to diagnoses in the medical and psychopathological domains. And similarly, any indicator of ‘performance’, or more accurately, of the degree to which some unobservable variable ‘exists’ in an individual, should perform as well as we expect in those domains. After all, detecting “scientific excellence” is in essence the same as detecting e.g. “HIV” or “narcissism” (although especially the last one might be uncomfortable comparable).

When dealing with the selection of individuals, or the predictive value of indicators for individual performance with the goal of applying this at the individual level, the variation between individuals is important, so you need to inspect this closely. This will give a much better idea of how useful a measure can be for predicting the performance of any given individual.

Conclusion

As it stands, I don’t see how the data justifies the recommendation to involve the JIF in researcher selection procedure. The evidence is simply too weak and shaky . . . Note that this does not mean that Bornmann and Williams’ work is not very interesting and useful. It’s their practical recommendation that does not seem justified and may be harmful for perpetuating the reliance on JIF as a proxy for scientific competence. This insistence that it is useful and desirable to express something like individual scientific competence using a proprietary journal-level index distracts from the actual problem at hand: how to select researchers in academia.

How to select researchers in academia

So, what do you do if you want (or need) to select researchers? First, it would seem a good idea to stop focusing on the convenience of counting (some aspect of) publication record and instead thinking about what you actually want. In the Netherlands, at least, academics’ day-to-day activities fall in roughly four categories:

Research (e.g. planning studies, collecting data, writing papers and grant proposals, and/or supervising these)
Teaching (e.g. crafting courses, teaching, supervising bachelor and master thesis students)
Admin (e.g. supervising colleagues, chairing departments, committee membership)
Societal responsibility (e.g. knowledge translation to professional organisations, media interviews, public lectures)

The first thing to do is to establish whether these four are what your university values. Most universities have strategic plans - and most universities explicitly value 1, 2 and 4, while 3 can be a necessary task depending on one’s position. For some universities, getting media attention, collaborating with professional organisations, or investing in societal responsibility (‘valorisation’ is the buzz-word currently in vogue in the Netherlands) is more important than for others.

Second, look at the career options for the researcher(s) you intend to hire. Will they be teaching? What percentage of the time? Will they craft courses or maybe oversee curricula, or only give lectures and tutor courses? Do you expect them to interact with the media, and if so, how much media attention do you want for your university? Should they write grant proposals? Which balance do you want between writing grant proposals and writing publications? Is the research mainly solitary, or is teamwork very important?

For example, most universities in the Netherlands currently only explicitly divide up faculty time between teaching and research. If those are each at 50%,then which of these would you like admin and societal responsibilities to eat into? Does your university value media appearances? If so, how many hours do you want your new researcher to invest in securing those and preparing for the performances? Should the researcher take up admin roles (e.g. department chair), and if so, should they do this in their research time or in their teaching time?

Once you determined the type of academic you want to hire, make sure that you craft your selection criteria to be consistent with those. If you expect your new researcher to only actually do research for 35% of the time (because they also need to take place in the library committee, the art committee, and the institutional review board, and you’d also like regular media appearances), make sure you don’t focus your selection process on their research skills; it is unwise to select somebody based on their performance in one-third of the job. Instead, give a lot of attention to their teaching competence as well, in addition to, for example, social skills. If, on the other hand, you hire a researcher mainly to obtain grant funding and they’ll have negligible teaching and admin responsibilities, focus on their grant proposal writing skills; look at whether they already successfully applied for grants, and ask them to send you one or two grant proposals they wrote, or one or two papers that they think best illustrate their grant writing competences.

Out of all these competences that can be required of a faculty member, the factors Bornmann and Williams recommend are only useful to assess one facet - and even then, it will often be better to closely inspect one or two papers that applicants themselves consider particularly illustrative of their current thinking and academic competences, than to try to simplify their performance into one quantitative indicator derived from journal ranks.

It is understandable that funders and employees desire a simple answer to the question “who are the best persons for the job?”. However, like with other complex jobs, no simple answers exist. In other domains, comprehensive assessments are customary. I’m not sure these would necessarily be a good idea; but the stubborn commitment to simplicity evidenced by the preference for one quantitative indicator seems naive.

So, the common-sense framework is as follows. First, accept that although it’s nice and convenient, you’ll have to stop focusing on research as if it’s the primary task of all faculty member, and instead, you’ll have to make the effort to draw up a realistic profile of what the staff member you’re hiring will be spending her or his time on. Second, determine which competences each of these tasks requires. Third, determine how you can establish whether an applicant has those competences.

In other words: don’t give in to the temptation formed by the convenience and apparent objectivity of quantitative indicators. Your university is your staff members. Investing in thinking about what you want to select pays off. Make sure your selection profile matches your goals, and don’t base your selection procedures on “what everybody else does”. Other universities have other priorities, other research and teaching foci, and other cultures. In addition, the recruitment procedures of most universities are still based on research performance alone. A selection policy that better reflects staff activities may be an easy route to competitive advantage. Selecting researchers is a responsibility: take it, and take it seriously, instead of relegating it to e.g. JIFs.

Final thoughts

First, I actually think this study is very interesting, and well done. Except that the data and analysis scripts weren’t Fully Disclosed, I still have no idea why the authors decided to categorize their predictors (obsession with normal distributions, perhaps?), and I increasingly consider only visualising aggregated variables (and not the raw data points) problematic. But for the rest, the fact that they visualise a lot, and consistently report effect sizes and confidence intervals, is great. So mostly I think this paper is very interesting. It’s just their practical recommendations that don’t seem backed up by the data they report, mostly because their design simply cannot provide evidence to make such recommendations. That’s of course an unfortunate fact of life; these data are what they are. Still, that doesn’t justify making recommendations based on weak data. If weak data are all you have, practical recommendations should be avoided, and you should consider the research more like agenda-setting rather than practice-informing.

And, second, more generally this blog post may give the impression that I’m ‘against’ bibliometric and scientometric research. I’m not. But as it stands, it’s very, very far removed from having practical applicability for e.g. applicant selection - and so should refrain from making recommendations in that vein lest unsuspecting researchers, policymakers or researchers read the paper and decide to follow those recommendations.