Stories of data sharing
Ryan A. Hill
Publishing scientific data is increasingly the standard as journals, universities, and governments seek to adhere to the principles of open and repeatable science. This critical development is accelerating a new form of science, often called eScience, where researchers leverage or combine previously published digital data to gain new scientific insights. While publishing data can be challenging, the benefit to science and society more generally from reuse and novel combinations of existing data is substantial.
The challenges to publishing data depend on data size and characteristics. If data come from a single study, there are several online publishing platforms, such as datadryad.org or hydroshare.org, that may be appropriate. Very large datasets or datasets that will be regularly updated may require custom solutions, adding significant difficulty. Factors such as storage location, size, cost, update frequency, database type, and use of application programming interfaces (API) for public access must all be considered. It is important to address even small details, such as variable naming conventions, especially for temporal variables that will be updated regularly. Failure to manage such details may require major revamping of a dataset after publication, thereby disrupting users that have built downstream applications based on those data.
Researchers rarely have formal training in data management and dissemination techniques, such as database or API development. Early consultation with a database professional or developer can be critical and reduce future pitfalls. Further, numerous resources exist that can help researchers make better decisions with their data and ensure they comply with the FAIR1 principles of data publishing; that data are findable, accessible, interoperable, and reusable. Even researchers that publish smaller datasets from single studies will benefit from familiarizing themselves and implementing the FAIR guidelines. Indeed, many journals have developed data publishing guidelines based on these principles. Taking the time and care to publish your data within these principles ensures that future researchers can make use of the hard-earned understanding of water resources we are developing today.
Despite the challenges, it is worth learning and adhering to good data management principles and publishing your data. In 2023, my collaborators and I published a large study characterizing the connectivity of millions of wetlands across the conterminous US (CONUS)2. This work was based on and benefitted directly from challenges and lessons we learned publishing two large datasets of geospatial watershed information, StreamCat3 and LakeCat4, which characterize hundreds of natural and anthropogenic features for millions of stream and lake watersheds across CONUS. Since publication, StreamCat and LakeCat data have been used in numerous unanticipated ways by researchers and practitioners across academia, government, and non-profits to understand landscape drivers of the physical, chemical, and biological conditions and quality of inland waters. As a researcher, it is deeply satisfying to see your data find wider utility in the water research and management communities. Ensuring that data are developed, shared, and curated in a way that follow best practices will facilitate this use and benefit future research and society.
Shihong Lin
When we published our paper in Nature Water5, we chose to share our source data and code in public repositories to simplify access for readers interested in reproducing our figures for any purpose6. Data-sharing platforms like Figshare, which integrate seamlessly with the Nature Portfolio submission system, make the process convenient and efficient. In our case, the shared data primarily includes tables of values used to generate plots in the main text and supplementary information. While sharing such source data may not be as crucial as sharing complex, multi-dimensional datasets (for example, genomics data, molecular dynamics simulations, or sensor data), we still believe in making all types of data accessible for readers interested in further analysis or reproducing figures for review articles.
Although the importance of sharing source data may vary depending on the dataset type, sharing source code is always essential when simulations play a key role in the research. Today, source code is mostly shared via public repositories like GitHub. Access to code is critical for enabling readers to reproduce results or extend models to explore new dimensions of research. In our recent Analysis paper published in Nature Water5, we introduced a framework for evaluating the performance of nanofiltration for selective ion separation. We shared the source code on GitHub (https://github.com/ruoyuwang16/NATWATER-22-0394-Data-and-Codes) and the datasets on Figshare6, with the hope that future researchers could readily use this package to assess the performance of nanofiltration-based selective solute separation.
Three months after publishing this paper, we were contacted by Professor Davies from the University of Birmingham, who encountered difficulties in reproducing our results. Through several email exchanges, we clarified certain confusions but also discovered critical errors in our code and the paper, such as missing files and inconsistencies between figures and source data. We addressed these issues by updating our GitHub repository and submitting an Author Correction7 to provide the necessary clarifications. Had we not made our source code so accessible, only a few readers would have attempted to use our proposed framework, and we might never have identified these errors that needed correction.
Erik Svensson Grape
As essentially all research data in the physical sciences are acquired, stored, and analysed in an electronic format, the idea of making that data available, not only to reviewers but to a larger community, appears a reasonable ambition. Doing so would permit further analysis, for example by experts in a subfield, while also promoting openness and transparency in research.
Making data available is actively encouraged by most scientific publishers, yet is commonly limited to the inclusion of a statement along the lines of “data is available from the corresponding author upon request”, and is rarely enforced. The availability of that data likely varies, and access may be further complicated as affiliations and email addresses change over time. From a general perspective, other questions also come to mind: how may that data be processed, and must all data be made available?
Having a background in inorganic chemistry, where most measurement techniques generate output in a spreadsheet-like format, little effort is required to curate and make data available. Still, this remains rare even in my specific field of research. Overall, I believe that all sciences stand to benefit from greater openness, and that it inevitably starts with a shift in attitude and expectations from the communities themselves — data availability should be the norm.
In my experience, the use of online repositories greatly lowers the barrier to making data available, which is why my co-authors and I chose to deposit the raw data of our recent study, as published in Nature Water8, in a general repository9. We did so wanting to showcase the techniques used and to be fully transparent with our results. Considering the ease of this process, I will continue to do so at every opportunity, and greatly appreciate seeing others do the same.
Outside of general repositories, the curation of databases pertaining to specific types of data can be incredibly valuable. Depositing data in a standardized format allows for systematic validation and efficient cataloguing of what may otherwise be an incomprehensibly vast body of results. Databases such as the protein databank for macromolecules and crystallographic databases for small molecules and materials have made an incredible impact in structural chemistry and biology, allowing for large-scale assessments and easy validation of results. As an example, the 2024 Nobel Prize in Chemistry, awarded for computational protein design and protein structure prediction, was greatly facilitated by the use of the protein databank, which is notably open access, that is, free for anyone to use. Again, this ties back to the concerted efforts of a scientific community, as the establishment of standardized formats and new norms of data availability is no small feat. This shows that curating data and making it publicly available can lead to highly impactful scientific innovation. It also raises questions regarding what such data may be used for, including further analysis by other researchers or as training sets for machine learning. Challenges aside, I believe that repositories will play a critical role in the move towards open and transparent science.
Yongqiang Zhang
In the rapidly evolving field of hydrology, sharing data through public repositories has become essential for fostering transparency, collaboration, and innovation. My experience with the dataset associated with my and my co-authors’ paper published in Nature Water in 202310 underscores the significance of making research data accessible to the broader community.
The decision to share the dataset was driven by several key considerations. First, journal requirements for transparency and reproducibility are increasingly recognized as standard practices in high-impact publications. Additionally, by sharing data, we aimed at maximizing the impact of our research, enabling other researchers to build upon our findings and extend the knowledge base in the field. This openness enhances collaboration and fosters a culture of shared learning.
While sharing the streamflow elasticity dataset derived from original daily flow and forcing data was relatively straightforward, we encountered challenges with the original daily flow datasets, which were compiled from various state water agencies acting as custodians. We also noticed that in developing countries, obtaining streamflow data — especially for transboundary rivers — can be significantly more difficult compared to developed countries like the USA, Australia, and those in Europe. The lack of available streamflow data in these regions introduces uncertainty when extrapolating modelling results, highlighting the need for smoother processes for accessing and sharing governmental datasets.
Since uploading data to Zenodo11, we have observed significant engagement, with over 831 downloads to date. This response indicates a growing interest in my research and emphasizes the value of data sharing. We have also received inquiries from students and colleagues seeking clarification on the dataset and its applications, reinforcing the importance of providing detailed documentation for users.
In addition to this specific dataset, we are sharing a 500-m resolution global PML-V2 evapotranspiration dataset covering the period from 2000 to 2023 through Google Earth Engine. This platform has attracted even wider usage, further demonstrating the demand for accessible, high-quality datasets in hydrological research. Similarly, we are also sharing our gap-filled GRACE satellite data.
In conclusion, my experience with data sharing has been overwhelmingly positive. Not only did it enhance the visibility and impact of my and my co-authors’ research, it also showed me that that data sharing contributes to a collaborative environment where researchers can innovate and tackle pressing water-related challenges. As we advance our understanding of hydrological processes, embracing streamflow data sharing as a fundamental practice will be crucial in shaping a more sustainable future for water resources.
Oliver S. Schilling
Although water is featured explicitly in only 1 out of the 17 United Nations Sustainable Development Goals (SDGs), it comes as no surprise that water is considered vital in reaching most, if not all, SDGs. Water is of such importance that studies that have analysed its importance in relation to the SDGs have even coined it as the ‘common currency’ linking all 17 SDGs. Unfortunately, while good progress is being made in reaching about a third to half of the SDGs, none of the sub-targets of the water SDG (SDG 6) are currently on track to being met by the target year. It is thus critical that we make every effort to contribute to improving the state of water on our planet and make fast progress towards reaching the goals of SDG 6.
What could be the ‘common currency’ for developing more sustainable and resilient water management plans and guaranteeing access to safe drinking water? It is, of course, data on water. This is particularly true for groundwater, which is the largest accessible freshwater reservoir on Earth and at the same time a mostly invisible one. A clear assessment of the temporal and spatial distribution of groundwater quantity, quality, and renewal rates, and its interactions with surface water bodies and the cryosphere, is critical for designing resilient and sustainable water resource management plans12. This task can only succeed if we obtain, curate, analyse and distribute high-quality data on groundwater according to FAIR principles. This is the avenue that we’ve decided to take for our research, and particularly for our study on the groundwater origins in the geologically complex, volcanic watershed of iconic Mount Fuji in Japan13.
Mount Fuji watershed can be considered the birthplace of Japanese hydrogeology, with many springs and groundwaters around Mt Fuji having now been studied for almost 100 years. To better understand Mt Fuji watershed, and to develop fundamental concepts for the management of volcanic island aquifer systems in the face of climate change, we have set it as our mission to not only obtain new data, but to also gather, curate, analyse, and review all possible existing hydrological tracer data that was produced throughout the last century and that we could find. It was only in combination with this vast amount of existing data that our data could be fully understood and interpreted, allowing new light to be shed on the origins of Mt Fuji’s groundwater. Publishing the entire curated hydrogeological database as a FAIR dataset, and thereby hopefully sparking a more resilient and sustainable management of the water resources of volcanic island aquifer systems, was a goal from the very beginning. We’ve chosen to publish our database on the water-focused FAIR repository HydroShare14. Not only do we now see a renewed general interest in Mt Fuji’s hydrogeology, owing to the publication of our study and FAIR database, we have also been able to launch many new collaborations and projects — and make many new friends along the way.
Damien Voiry
The reclamation of wastewater demands innovative and scalable solutions to ensure clean water access for growing populations. Nanotechnologies offer new perspectives in the field of water purification via the development of novel catalysts and membranes that leverage the unique properties of nanostructured materials. Twenty years after the rise of graphene and other two-dimensional (2D) materials, 2D-based membranes have emerged as a mature field of research for applications ranging from energy generation to gas separation and water purification.
Nanolaminate membranes — created by vertically stacking exfoliated nanosheets — feature ultra-thin, layered architectures with precise control over molecular sieving properties. While much research has focused on reducing membrane thickness, engineering the stacking architecture via covalent functionalization presents a new frontier. This approach allows fine-tuning of surface chemistry and stacking order, resulting in exceptional separation performance, including high rejection rates for contaminants and improved permeability compared to conventional membranes15. Such capabilities make nanolaminate membranes highly attractive for applications such as desalination, industrial wastewater treatment, and the removal of emerging contaminants like pharmaceutical residues and microplastics.
Addressing the pressing challenge of clean water availability is unlikely to be solved by a single technology, material, or research group. Progress in this field requires a collaborative, multidisciplinary approach. To move forward, we believe it is crucial to make data both accessible and understandable, with public repositories playing a key role in fostering transparency and innovation. Our group has been committed to this effort since 2019, following the initiation of a European Commission-funded project. The emergence of high-throughput experimental strategies is further revolutionizing the field, generating vast datasets that make sharing even more essential. In the age of artificial intelligence (AI), these shared datasets are invaluable for training predictive models that can guide the design of future membranes16. This aligns closely with our research focus, as we are now leveraging machine learning to predict the most suitable functional groups for attachment to nanosheet surfaces. By optimizing interlayer distances, stacking order, and mechanical properties, these predictions aim to refine the design of nanolaminate membranes and accelerate their development.
Data sharing, combined with advancements in AI and high-throughput methods, represents a transformative pathway towards the creation of scalable, high-performance membranes capable of addressing global water challenges. Bridging the gap from lab-scale breakthroughs to large-scale deployment will require concerted efforts in scaling fabrication processes and refining membrane structures. In this context, the community will find a solid foundation in data sharing to advance the field. Success in these areas could unlock the potential of nanolaminate membranes, positioning them as a viable technology for sustainable water purification.
Responses