At the EGU ESSI division meeting, Christoph Stasch was elected as the new representative of the early career scientists in the ESSI division following Jennifer Roelens. Christoph works as research associate and consultant at 52°North, a non-profit research organisation in the field of applied geoinformatics. His focus is on simplifying the integration of sensors and processing modules (e.g. environmental simulation models) in spatial information infrastructures and GIS applications. As ESSI ECS representative, Christoph hopes to strengthen the network of ECS in the ESSI division. Interested in participating? Then get in touch with the network or Christoph directly.
As many people within the ESSI division have at least once used GIS software, we would like to wish you a happy GIS day!
Every day, millions of decisions are being powered by Geographic Information Systems (GIS) for education, government, non-profit organizations and businesses. ESSI deals with community-driven and multidisciplinary challenges. GIS plays an important role to develop data-driven solutions that help many organizations visualize, analyze, interpret and present data.
The amount of digital data per person is rising with a geometric progression since 2009. According to the latest report of Oyster IMS, the digital universe will grow by a factor of 300 between 2005 and 2020: from 130 Exabytes to 40,000 Exabytes, or 40 trillion gigabytes (more than 5,200 gigabytes for every person in 2020). Earth sciences is one of the domains where huge volumes of data are collected.
Let’s take a database of geological samples of the Institute of Geology, Taras Shevchenko National University as an example of Big Data (BD). This database consists of tables, which contain geochemical, petrological, mineralogical and petrophysical information of 11.800 samples of granitoids from the Ukrainian shield (crystalline massif). These tables are downloaded into MS Access tables and consist of general (information about number of sample, geographical coordinates of sample’s pick-point location, mineralogical and chemical composition, photos of fine-section of the sample, petrophysical information) and additional (characteristic of region and geological structures from which sample is taken) tables. Each data entry is assigned a unique identifier to link data tables with each other. The database structure allows to request information about a sample by entering its unique identification number, the type of rock or characteristics about its content. This saves time and energy.
The biggest volumes of information in geophysics are primarily represented by 3D seismic data. Such huge amounts of data are stored because of the large areas, high density and high resolution of the acquired information. For example, “an area of 200 km2 of 3D seismic off-shore acquisition data occupied 30-40 GB of information in 1999-2001; in 2004 with 968 channels per block, 100 – 130 GB of data was acquired. 220-250 GB of data was obtained with 1280 channels in 2010.” – says seismic processor and interpreter P. Kuzmenko. “But nowadays to investigate precisely the structure of a reservoir, a 3-component wide-azimuth and full-azimuth seismic acquisition is applied, which is done with 7600 – 51200 channels and as a result the amount of digital data raises to 1.5 – 10 Tb. Doing this on land, this can take up a volume to 100 Tb. Modern acquired data is stored compactly in electronic databases as digital data. In addition, there is a great amount of ancient geophysical information on paper (maps, well-loggings, reports), which should also be stored in electronic databases to prevent their loss. It means that amount of data for interpretation will raise in geometric progression, if paper materials are converted into digital”.
You can say, “Why do geoscientists need such amounts of raw data? They can analyze it and then delete it!” However, things are not that simple. Data from previous investigations may be useful for further stages of oil-field exploitation, scientific research and comparison with nearby territories.
To sum up, great resources are needed to store and analyze huge amounts of data, but thanks to BD storage and analyzation techniques, important decisions are taken and as a result technologies are developed very rapidly especially in the past 10-20 years. Zettabytes of scientific data contains important information, which can help to develop sustainable life-style, predict and sometimes even prevent dangerous events.
The team encourages you to send us your own thoughts about Big Data or other ESSI related topics! We invite students, scientists, professionals and other interested in geosciences persons to answer several questions:
- What is the boon and what is the bane of your research with Big Earth Science Data?
- What challenges do you face in your daily grind of data processing?
- What challenges of Big Earth Science data do you address with your research / current work?
Gratitude to Ph.D. in geophysics P. Kuzmenko and Ph.D. in geophysics O. Shabatura for provided information for this article.
We are in the era of Big Data. Big Data is a ‘hot’ topic. It is a popular term often associated with an increase in volume, variety and velocity of data. The Copernicus programme for example, the European Union’s flagship programme on monitoring the Earth’s environment using satellite and in-situ observations, anticipates a massive increase in satellite data volume. It is estimated that solely the Sentinel missions, Copernicus’ space component, will produce 4TB of processed data each day (FDC 2016).
The European Centre for Medium-Range Weather Forecasts (ECMWF) hosts the Meteorological and Archival Retrieval System (MARS), which is the world’s largest archive of meteorological data. The archive currently holds more than 90 PBs of data and continues to grow by additional 3 PB every month.
Big Data and an increase in data volume comes along with an increase in computing and processing power. Or is it the other way around? When Gordon Moore, co-founder of Intel, introduced 1965 his observation that the number of integrated circuits doubles every two years, he did not think of an associated exponential growth of data. Moore’s law since then has been adjusted to 18 month, but it is equally applicable to data growth.
The increase in data volume is partly due to new sensor technologies and new kind of data. The variety and type of data has never been more diverse. Sensors and satellites continuously collect data and monitor the state of the Earth. The Internet of Things brings a constant flow of unstructured data content. The speed at which data is generated and moved around has increased tremendously. In 2013, IBM was releasing a number that 90% of all of the world’s data has been generated in the past two years. This number has most likely been growing since.
To recapitulate: at a first sight, more data, new data sources and a constant data flow sound like a true boon for every data scientist. However, it is vital, when talking about Big Data, to differentiate between raw and unstructured data and value-added information. Information is extracted from raw data. Information and insight is what is real needed. The challenge is to turn Petabytes of raw and unstructured data into kilo- or megabytes of refined information (Rowley 2007). Decision-makers need refined and actionable information to base decisions, policies and recommended actions on. The question is if we can expect an increase in information at the same speed as Big Data is generated. And there the bane of Big Data comes into play.
The bare presence of Big Data is not enough. Turning Big Data into information brings new challenges along the entire data value chain. We face challenges in data generation, where we have new data sources and types from social-media, citizen-empowered science, crowdsourcing and unmanned aerial vehicles. We face challenges in data storage and management, where questions related to high performance computing architectures, interoperability of data management systems and cloud computing have to be addressed. Data governance, data licensing and metadata are further essential areas that have to be dealt with.
We also face challenges in data analysis, where data mining and machine learning is a ‘hot’ and popular topic. And we face challenges in data insights, especially related to data communication and visualization. The best research and findings are valueless if not communicated properly.
These challenges along the entire data value chain are well reflected in the four official subprogrammes of the Earth Science and Space Informatics (ESSI) division of the European Geoscience Union, which are: (i) Community-driven challenges and solutions dealing with Informatics, (ii) Infrastructures across the Earth and Space Sciences, (iii) Open Science 2.0 Informatics for Earth and Space Sciences and (iv) Visualization for scientific discovery and communication.
ESSI is a very interdisciplinary field and compared to other geoscientific disciplines, a rather recent but important research field. The keen interest of the geoscientific community in ESSI was reflected at this year’s EGU, where there was often a mismatch between a too small size of the room for the ESSI sessions and the high number of people interested.
Coming back to the question in the headline: “Big Earth Data – Boon or bane?” Let’s find a compromise maybe. What about considering Big Earth Science Data as boon, as a start? The amount of freely available, high-resolution data products and current processing capabilities give us new opportunities that have never been possible before. And we gain hidden insights into the state of our Earth. Key to the great potential Big Data incorporates is open data access. Related to the importance of open data policies I recommend Barbara Ryan’s TEDx talk about Unleashing the Power of Earth Observations (TED 2014). Barbara Ryan is the General Secretary of the Group of Earth Observations (GEO) and illustratively explains the positive outcomes of opening up the entire Landsat archive in 2008.
The era of big data forces us to rethink and disrupt our common data processing approach. Currently, a data scientist spends 80% of the time with managing and pre-processing the data and has only 20% for the actual data evaluation. Every stakeholder along the data value chain, from data generator over data provider to data user has to work on innovative approaches to tackle concurrent challenges and to leverage the full potential of Big Earth Science Data. The bane comes into play, if we continue generating and storing massive amounts of data and fail to turn it into value-added content.
What is the boon and what is the bane of your research with Big Earth Science Data? What challenges do you face in your daily grind of data processing? What challenges of Big Earth Science data do you address with your research / current work?
We would like to know about it. This blog post is the first of hopefully monthly blog post contributions of the ESSI division and we are looking for any contributions within the ESSI community.
Related to this blog post, the movie Big Earth Data is highly recommended.
FDC (2016): Data Volume | Copernicus. – http://newsletter.copernicus.eu/article/data-volume (last access: 2016-06-29)
Rowley, J (2007): The wisdom hierarchy: representations of the DIKW hierarchy. Journal of Information Sciences 33/2: 163-180.