Caroline Spann
University of Natural Resources and Life Sciences, Vienna
During the EGU2016 Assembly, the chair of the statistics advisory panel of European Journal of Soil Science offered a short course on common statistical problems in soil science papers called “Secrets from the statistics panel: common statistical problems in soil science papers”.
Motivation to start such a short course was to diminish statistical problems occuring in submitted papers, because Editors and Statistical Advisory Panel of the journal often find the same statistical problems in papers. Similar problems can be seen in papers submitted in all soil science journals.
The first important lesson of the course is to check out the statistical guidelines every journal offers for authors of papers before writing one. These can be found on the websites of each journal, but are probably the least-read items.
Another common problem in soil science papers are obscurities about sampling designs and their appropriateness. It is simply often unclear how sample sites were selected. Depending on the objective of a study (e.g. estimation of a global mean, local prediction, detection, etc.) a sampling technique must be designed, which can either be probabilistic or non-probabilistic (opportunistic, systematic, purposive). Even though randomized sampling is ill reputed it might often be the best way to design a sampling for a study. In papers it needs to be made clear how sampling spots were chosen. If a random sampling was performed it needs to be described how, because ‘random sampling’ means random and independent sampling. Often samples are just taken ad hoc – which is not random! Sampling will need to involve generating coordinates for sample points with a random number generator. If other approaches are applied (e.g. dicing, throwing a quadrat, …) they need to be controlled and described as well.
Independence is a further source of confusion, because how can valid assumptions of independence be made when studying the soil, where spatial dependence is the norm? Two variables are defined independent, if nothing can be deduced about one from information about the other. Independence is also something a scientist has under her or his control. “If and only if we sample a variable independently and at random, can we assume independence” Murray Lark made clear.
Designs need to be described with enough information so every reader could reproduce the study design. This may sound self-evident but still is a deficiency in many submitted papers.
A final thing pointed out by Murray Lark was that statistics are meant to help scientists test their hypothesis, not to spit out the right results!
The short course was well attended! For all those who could not make it I can recommend downloading the slides of the course provided by the British Geological Survey to get a closer look into what the session was about. In the package you will also find some helpful R codes you are free to use http://www.bgs.ac.uk/training/eguShortCourse.html.
A take home message of the course is to go and ask a statistician to resolve uncertainties! They don’t bite!