EGU Blogs

Big Data

Diving too deep?

A new initiative has just been announced that could help to revolutionise palaeontology. PaleoDeepDive is essentially an automated version of the Paleobiology Database, which is an online, professionally crowd-sourced and curated database of fossil occurrences pulled from the literature.

They have a launch video here:

Click here to display content from YouTube.
Learn more in YouTube’s privacy policy.

I have a couple of reservations about this. Firstly, how do they expect to mine data from articles that are mostly still locked behind paywalls, at least legally.

I’m also a little concerned about the precision of their algorithms. Towards the end, they mention that in a sample of 500 articles, they get 15000 species names, whereas the PaleobioDB only picks up 1100. Well, in the latter, these names are occurrences – explicit records of fossils in time and space. What these 15000 represent is not clear – are they just those that are mentioned in the text, and therefore don’t really have any use, or are all the palaeontologists really just missing out on 90% of the data when extracting manually?

Additionally, I am concerned about the linking of metadata, such as the location and age of fossils, as well as data about the geology, environment of deposition, taphonomy etc. All of this information has to be sifted out of articles from within a host of information in articles when extraction is manual. I’m not sure if a machine will be able to distinguish between, for example, geological dates from something related, but not directly the age of the fossil, in text.

Anyway, these are just preliminary thoughts, and am sure that they have crossed the developers’ minds at some point, I look forward to seeing how this progresses, and undermines a lot of my work! 😉

Oops.

Oops.

Also, I’d love to hear any thoughts or comments you have about it!