Maria, a member of our ECS team, recently interviewed Dr Eric Daub from The Alan Turing Institute, London, UK. Here the Seismology ECS Team wants to know how we can do code better. Together.
This is the first interview with software engineers explaining the importance of good practices in software development.
Dr Eric Daub received his PhD from the University of California, Santa Barbara, in computational physics, where he studied numerical models of earthquake rupture and failure of amorphous materials.
Eric is a Principal Research Data Scientist (RDS) at the Alan Turing Institute (ATI), the UK’s national institute for data science and artificial intelligence. Research Software Engineers and Research Data Scientists are generally expert collaborators, they have technical expertise in software engineering, data science and doing computational analysis but generally don’t work in the same field as their training.
I’ve learned all my programming essentially on the streets.
In this series of ECS-seismology blogs, we’d like to hear from experts like you on software development in seismology. Before we go into details, how did you learn to code and what is your favourite programming language, and why?
I’ve learned all my programming essentially on the streets [Laughing]. I am very much self-taught. Probably most of the work I do these days is in Python but a lot of the high-performance computing software I’ve written over the years is in Fortran and C++.
I had terrible practices when I was a student. That has gotten better over time to the point that I feel I can respectably call myself a software engineer.
What was the first piece of code that you wrote?
My first experience was in primary school. It was a summer enrichment program where I learned programming in BASIC and I remember very vividly there was this cookbook of different programs. One pretty cool one was a password program. It wasn’t a program that generated a password itself, you had to put the password at the beginning of your program! So, if you want to run the program you have to put in the password. I thought it was very clever that I could write a program that my friends couldn’t run.
But then I realized in the process of doing it that my friends could simply look at the program to see that there was plain text coding of the password there. I didn’t know much about computer security in primary school but it occurred to me that it seems not like a great way to do things.
I basically taught myself everything else that I’ve known since that particular computer class. I learned a bit as a physics major in undergraduate and in grad school I was constantly teaching myself things.
Over time I got more sophisticated and looked for resources on robust techniques, version control, unit testing, things that professional software developers use very heavily. I think these practices are less common if you’re a self-trained academic. I spent a lot of time over the last five years becoming familiar with those, trying to put this in practice to my own software to ensure quality because that was something that I have increasingly realized was important for doing good computational science.
If you find yourself copying pasting code over and over again you are doing something wrong.
What was the most important thing you learned about coding in science?
The most valuable lesson was and remains: “Don’t repeat yourself!”
I come back to it all the time. When I was an academic, I taught graduate students programming and my thing was to tell them if they find themselves copying and pasting code they are doing something wrong!
If you’re copying and pasting code that means that you should write a function to do that correctly so that you only have to do it once. It makes it easier to maintain the code because there’s only one place where that operation ever takes place.
In the olden days, I copied code from the one program to another, not realizing that I should write a library if I use that operation over and over again. That is a lesson that I still come back to when I catch myself either copying and pasting something, then I say to myself “wait a minute that’s a sign that I’m not doing it right”.
Writing a function or library helps to organize your thoughts in a very structured way. I also think that seeing other people’s software and seeing how they’ve done these things well and learning from examples is a very effective way of improving your coding skills.
This is a lesson that more professional developers should take to heart. It is actually a flaw in your program if you have something that looks very similar to something else.
It took me the longest time to really internalize this and it is still a thing I am thinking about when doing my job.
You get better at things by deliberately practising them.
Most of our readers have not had formal training in software development. They are mostly self-taught, like you. In your experience as a professor and who has supervised many students, what are the best ways for students and early-career scientists in seismology to learn a new programming language?
I think a lot of the software carpentry lessons are publicly available, and they teach short courses which I think are excellent and very high quality. They emphasize important lessons that scientists need to learn and understand. I would certainly highly recommend those.
One thing that I probably didn’t appreciate was the value of code review. You get better at things by deliberately practising them. If you have an expert coach [a more experienced programmer] you can work with, they can give you explicit feedback that can improve your expertise. A lot of universities have at least a few people like that. If you know someone that can give you helpful feedback on the style of your code, you should take this chance.
When I taught a programming class, I was very explicit that style is actually important in my course. A lot of people when they are learning to program say they don’t care about style as long as it works.
To improve my programming, I focused a lot on style, making my code more readable and organized, re-writing things several times if needed. In other words, when someone else is looking at your software will they be able to understand your work?
In ATI we try and have every piece of code reviewed by somebody and it’s hard at first if you are not used to this level of scrutiny! Showing what you’ve done to other people can be hard the first few times and for that reason, people don’t publish their code as much as they should. The fear that someone is going to point out that you did something wrong is in most occasions unfounded because people are much more likely to be very supportive and helpful!
Open-source communities tend to have a code of conduct which is saying we’re here to improve things, we’re here to help people, to be respectful, to value people’s contributions and we want them to feel welcome!
At the same time, open-source communities want to ensure high quality of software. Hence, if you don’t necessarily have someone that can review your codes then contributing to open-source projects, where there are lots of volunteers that have a lot of experience and are very good at what they do, is an excellent way to also get feedback.
You can also look at how projects that you admire organize things. There are a lot of great examples of open-source projects that have a well-structured community and there are always things for new contributors to take part in. I do think that contributing to open-source software is a great way to become more experienced as a programmer and get some really helpful feedback and understanding how to do a job better.
…you don’t really know what some of the problems are going to be until you actually try something!
How do you approach a new project with your team?
We use GitHub a lot to collaborate. It’s nice and can be a good way to save records of everything. When we have meetings, we will put the minutes up on the wiki, we can use “issues”, “boards”, arrange tasks and give some clarity to collaborators as to who is doing what. Especially now, working remotely, it gives one shared understanding of what we’re doing.
In a new project, we often start with a “phase zero”. It’s a software thing you just write, it is the simplest implementation that you ever can come up with, and the goal is to make mistakes and learn from them. Oftentimes in professional software development, the client gives the specs and then the engineers work out what the whole system should look like. But research isn’t really like that you don’t really know what some of the problems are going to be until you actually try something!
In the end, we try to learn something from phase zero and then we throw it all away. Then we actually try to do it right.
Version control has bailed me out so many times in my life I can’t I remember the time before I used version control!
We hear the term “best practices” in software development. What are those best practices?
Don’t repeat yourself. Write modular code. Think about tests for your codes, and make it easy to run these tests – ideas like Continuous Integration, where the tests are run automatically every time you change the code, are great. Document understandably and clearly! Use version control and take advantage of peer-reviewing codes.
For testing, the most important thing to do is to structure your code in a way that makes your code testable. Some people are very strong advocates of test-driven development. That means you write a computer program that automatically tests your code. If your code is written to be testable it’s easier to debug because it’s easier to test it automatically.
To modularise your code, you separate certain tasks to different functions and classes.
Documentation is an interesting one since there are a lot of different types of documentation. People often think of documentation as an API [application programming interface] documentation that describes all functions, classes, etc., which I think is very important for libraries. But for a lot of scientific software that’s less important than things like “what is this software useful for?” Here, a useful tutorial example of how you use this and illustrating typical use is hugely beneficial.
Version control has bailed me out so many times in my life I can’t remember the time before I used version control. When I was a PhD student, I did not use it and it is a problem because I can’t easily reproduce any PhD work anymore. I can go back and look at the whole Fortran code that I wrote (I think I have a copy of it somewhere), but then I don’t have access the original computer system that I used anymore. I don’t have the same compilers and libraries, so there are all kinds of little hidden things and thus even though I have the code I have lost the context of how I used the code because I didn’t write it all down formally.
There is this much wider sphere of things that are important when you’re doing science. I think that’s one thing that I’ve come to appreciate over the years is that there’s a lot of context missing from just having access to the code.
Finally, in my experience, writing software for other people is really a helpful thing to do. Also, having people use your code helps you find bugs a lot faster than you can find them yourself!
I think that’s everything that I evangelize about. In the end, with good practices, it’s about recognizing that in the long run, it is easier to do it than it is to not do it! I think conditioning yourself to use best practices enough so that you can see the times that it helps you recover from a mistake (and we all make mistakes!), can be very powerful.
When I was in Graduate School I could not even imagine the thought that I would be able to type something on the internet pull down all of the entire history of a project software in an isolated Linux environment where I can have everything pre-installed.
Why do you think reproducible research is important? How can we ensure reproducibility in seismology?
If you absolutely want everything to be reproducible then you have to go through a lot of work to sort out every step and automate them. It’s a lot of work and if you are a busy PhD student trying to graduate or postdoc who is on the job market and is worried about finding the next position… That’s the time when you could be doing new research rather than setting up your current work to be reproducible.
It’s really hard to capture all these auxiliary pieces that make up the context of how a result was obtained. When we’re interviewing people for our team, we ask how they ensure reproducibility and a common answer we get is that sharing your code is all that one needs to do, but it’s not even close to the full story of what you need to do!
I like to think of there being three components to any analysis. There’s code that you can write and run. There is the data that you use. Finally, there is the context in which your computations were run.
First, the code you develop and run is the easiest thing to share. The days of authors making the codes only available upon request are starting to fade into the distance. Now it’s more standard to make your code public either as a stipulation for funding agencies or from the journal you publish in.
Then there’s the question of the data. Most journals will require a data availability statement. Seismology is probably better than a lot of other fields and there are things like the IRIS data centre and other networks that are publicly available. The downside is sometimes people’s data is so big that the only way you can get all the data set on a hard-drive somewhere, so that can be a barrier.
Finally, an oftentimes under-appreciated topic of reproducible research is the context in which you ran your analysis. This could just be the operating system and other versions of packages you installed on your computer. If you’re doing a lot of simulations that could be your HPC environment which is usually more difficult for someone to reproduce. They don’t have access to that actual hardware and not to mention the specific configurations. Also, what if I don’t have £100,000 to spend to run simulations on the computer cluster you used in your paper? That’s not going to be easily reproducible.
There are ways to help ensure reproducibility. There’s been a lot of progress on that and things like docker where you can run codes in an isolated container on your system. The Docker image contains the particular packages and libraries needed to execute a program. However, this is still not always enough to ensure exact reproducibility, but it is a really positive step.
I think there is there’s a lot of challenges in reproducible research. But compared to where reproducibility was 15 years ago we have made a lot of progress. When I was in Graduate School I could not even imagine the thought that I would be able to type something on the internet pull down the entire revision history of a software project in an isolated Linux environment where I can have everything I need pre-installed.
I think that if people actually work hard and set standards for reproducibility, actually value and understand there is a lot to be gained from transparency. Git and Docker are a good step in the right direction but there are still other things that we can do to make papers and methods more accessible and understandable so that not only can I reproduce your work easily but then I can change and modify it without too much effort by building on top of what you have done well.
We need a culture where feedback helps improve things.
In one of our previous blog post called “Git or Perish”, we talked about sharing codes with the community on platforms like GitHub. A lot of our readers agreed on that; however, it seems there is little incentive to do this in practice. What do you think the reasons are and how can we overcome those barriers?
In ATI we work with academics all the time and at first, we were not very pushy about whether our code repositories would be private or public. There is a biased stance that the code should be private until we publish papers because we are worried about being scooped.
My feeling is that this is the wrong approach because oftentimes if I make something public someone will use it and then find mistakes and bugs and actually make it better. My stance now is that any repository should be public unless I have a reason not to make it public, rather than it is private until we have reason to make it public (i.e. opt-out, not opt-in).
I think by saying that all of my repositories are going to be public it forces me to think more carefully about what I’m doing. If people over the whole world are going to see this I better make this good. It is similar to if you’re presenting a paper to a bunch of friends versus presenting some work at a conference. The stakes are different when there is a different level of scrutiny. By saying to myself all of my code is going to be public from here on out that actually helps me become a better software engineer.
We need a culture where feedback helps improve things. Our team holds drop-in sessions for students and when they come to us we don’t criticize them but rather help them improve their codes.
I think right now software-development doesn’t get the same sort of recognition as peer-reviewed publications.
How can we teach better coding/software practice? It seems we are still lightyears away from implementing coding practise in schools and undergrads. Coding is still something mostly self-taught and the incentive to become better is quite low because of the “publish or perish” attitude in science. Software development is “nice” but does not lead to a high h-factor.
I think that anytime you try and measure something that will automatically create incentives to try and gain that metric.
A classic example of that of a city trying to reduce air pollution. A new rule is established, you can only drive your car on Monday, Wednesday, Friday if you have an odd number license plate. On the other days, you can only drive cars with an even number. That seems like nice reasonable incentives. But people would buy a second car with the opposite type of license plate number and since it is a second car it is usually purchased at the lowest price, which usually means emission standards are below what their main car is… [see also Cobra effect]
I think right now software-development doesn’t get the same sort of recognition as peer-reviewed publications. If we want to have some citation metrics for software that are equivalent to that we sort of have to be careful about that.
You should not give credit based on lines of code to measure the productivity, which we know from extensive experience is a horrible metric because people will just publish bloated code.
If you’ll be judged on the number of libraries, people just split one library into two. If we want a proxy measure like h-index to measure the contributions of software engineers, we need a system that captures the equivalent of citations, but I am not sure what that should be.
I think there are things like the Journal of Open Source Software. JOSS papers are nice because it is more dynamic than a static journal article. It can evolve with the contributors rather than just be a definitive list of authors. For instance, I join a project a week after they submit their paper and then spend the next three years doing tons of contributions to their project and then everyone just cites the paper and I don’t get any credit for that. A contributor model is a better way to measure what people produced and how they contributed to a project.
I found it useful to try to do fewer things better.
What is the most important tip you would like to give to the programming community?
I found it useful to try to do fewer things better. It is hard but I have found it tends to be rewarding! I found that when I shifted my focus on getting better at specific topics it’s made a bigger difference in my career. It also affected my overall happiness of how I do research and what I do daily, what I enjoy doing, and has increased the impact I have had on the scientific community, in general, knowing that more people are using and building on what I have done.