Numerical modelling is not always a walk in the park. In fact, many of us occasionally encounter problems that we cannot directly solve ourselves, and thus rely on help from others. In this month’s Wit & Wisdom post, Patrick Sanan, postdoctoral researcher at the Geophysical Fluid Dynamics group at ETH Zurich, will talk about asking the right questions about scientific software. As an experienced scientific software developer, Patrick has often been at the “receiving end” of questions regarding numerical modelling and hopes to guide you through some important points that could make life for you as well as your ‘helper’ a lot easier.
Numerical modelling is essential for geodynamics; since we cannot directly measure relevant phenomena, we partake in the magic of making a set of reasonable assumptions, setting up a model, and letting a system evolve to produce insight. It’s beautiful. We gain an understanding of the subtle-yet-fundamental processes which shaped our apparently-so-special Earth. We turn our eye beyond, to other planets.
I’m not here to talk about that, though. I’m here to discuss an ugly part of the job, which can bring all the profundity to a screeching halt: what to do when the code doesn’t work.
You know the situation. You’re stuck. There’s no output. Segmentation fault. Error Code -123. You didn’t sign up for this…
You don’t know where to start looking. Is it your model parameters? Your physical assumptions? Are you using the code as intended? Is it your compiler? Can it be the cluster? Your .bashrc? Is your keyboard plugged in?
You’re frustrated. No one around you has a good suggestion. Why are these cruel computers doing this to you?
What to do? Ask for help, in the right way. In this blog, I’ll point out some facts about scientific software, try to use them to formulate an effective email in which you ask for help, and then try to extract some guiding principles.
Some Points on Scientific Software
First, I’ll list some observations I, as a developer, have made about scientific software:
- Scientific software projects are usually short on maintainers and time.
- Software designers are not psychic, but they are often experts at deductive problem-solving.
- Reproducible, bisectable problems are surprisingly easy to solve. Other problems are surprisingly difficult.
- Working reference cases are very valuable.
- Sufficiently-complex computing tasks must be treated like lab experiments: one must document the setup and control as many factors as possible.
- The solutions to most problems are obvious, once found.
- Numerical software is harder to test than generic software, because floating point arithmetic and parallel computing lead to acceptable differences in numerical output. Higher-level understanding of physics or (parallel) numerical methods is often required to know if something is a “real problem”.
With these as a guide, let’s consider how one might try to resolve a confusing issue.
No pretty pictures here, just error messages (if you’re lucky).
Asking for Help: The Bad Way and the Good Way
Email is a common way to ask for help, as often the person who can best help you is across the world. Let’s say that I’m using a regional lithospheric dynamics code called Rifter3D. I’ve come across an error I have no explanation for. I can’t figure it out, so I write an email to developers of the code. You might also write to a dedicated help address.
Hello - I'm using Rifter3D but on our local cluster I get this error:
/cluster/shadow/.lsbatch/1559505025.92314771: line 8: 951 Segmentation fault (core dumped) ./rifter3d -options_file options.opts
Do you know what I'm doing wrong?
The person on the other end wants to help, but has no information to work with and doesn’t know how much of their time you’re asking for. There is a better way:
Hello - I'm having some trouble diagnosing a problem using Rifter3D and was hoping
you could give me some pointers.
I've been working with Prof. XYZ and have modified a rifting scenario from XYZ et al. 2017 to study the effects of varying A on B. When I run a small case on my laptop, the simulation finishes as expected. I use the attached options_small.opts and run mpiexec -np 4 ./rifter3d -options_file options_small.opts. I'd like to run a bigger case (512 x 256 x 64) for 300 million simulated years, on 64 cores on our cluster.
However, my job fails, producing a segmentation fault almost immediately. I attached the job submission script (job.sbatch), and output from my run (lsf.o92314771). I am using the cluster's existing PETSc 3.11 module, and version 1.0.3 of Rifter3D. I can successfully run a simple isoviscous setup on this cluster - see attached job_small.lsf, option2.opts, and lsf.o92314681
Do you have any insight as to how I might be able to run this simulation?
Attachments: options_small.opts options.opts job.lsf job_small.lsf options2.opts lsf.o92314771 lsf.o92314681
The recipient will likely respond with more questions as you work towards resolving the issue. Perhaps they’ll ask you for more information about how you built Rifter3D, or point out some unusual settings in your options file.
Why is this second email better? It is not simply longer, but
- It clearly describes the problem. Just trying to precisely describe a problem has an almost-magical clarifying effect, and the solution will often appear. Software engineers call this rubber ducking.
- It explains the true objective and how the problem is to be resolved. This is important to avoid the XY problem, describing a problem with a method to achieve a goal, without mentioning the goal itself.
- It gives concrete information to reproduce the problem: the version of the code, input files, and launch commands/scripts.
- It provides enough output to allow deductive reasoning, more than just a copy-and-pasted error message.
- It’s polite
- It shows some effort has already been put into investigating.
- It notes similar, working cases.
- It’s not too long, but it is detailed enough to allow for quick, intelligent follow-up questions. Supporting data are included as attachments or links.
- It doesn’t make too many assumptions about the cause of the problem.
Boiling it Down: 3 Questions to Ask Yourself
When asking for help, consider these three questions. They will help with the central objective: clearly describing the problem.
- Why do you need it to work?
What is the context? What is the goal?
- How do you show that it doesn’t work?
What are the steps to reproduce your problem?
How will you know that the problem is resolved?
- What does work?
How far are you from a working state? What similar cases work?
Here is a 1-page pdf which you can print out, with some of this advice:
Pdf and LaTeX source on GitHub
The “Real” Answer
To conclude, I will try to avoid an “XY problem” of my own. The most efficient way to resolve bewildering problems is to avoid them. To make an alpine analogy, the most important topic in avalanche safety is not how to dig someone out, it’s how to avoid risky terrain.
First, borrow techniques from software engineering. Version control (e.g. git) will encourage you to save working states, amongst many other benefits. Next, leverage your intuition and experience as a geodynamicist. Always be able to quickly run and verify small, quick, simple cases, and test and visualize often. Look out for simple cases where you “know the answer ahead of time”: established benchmarks and analytical solutions.
The points in this article can help everyone save time (and not just running geodynamical models!): problems will be resolved more quickly, bugs will get fixed faster, and more time can be spent exploring more interesting questions than “Why doesn’t it work?”.