Stories of data sharing

Ryan A. Hill

Publishing scientific data is increasingly the standard as journals, universities, and governments seek to adhere to the principles of open and repeatable science. This critical development is accelerating a new form of science, often called eScience, where researchers leverage or combine previously published digital data to gain new scientific insights. While publishing data can be challenging, the benefit to science and society more generally from reuse and novel combinations of existing data is substantial.

The challenges to publishing data depend on data size and characteristics. If data come from a single study, there are several online publishing platforms, such as datadryad.org or hydroshare.org, that may be appropriate. Very large datasets or datasets that will be regularly updated may require custom solutions, adding significant difficulty. Factors such as storage location, size, cost, update frequency, database type, and use of application programming interfaces (API) for public access must all be considered. It is important to address even small details, such as variable naming conventions, especially for temporal variables that will be updated regularly. Failure to manage such details may require major revamping of a dataset after publication, thereby disrupting users that have built downstream applications based on those data.

Researchers rarely have formal training in data management and dissemination techniques, such as database or API development. Early consultation with a database professional or developer can be critical and reduce future pitfalls. Further, numerous resources exist that can help researchers make better decisions with their data and ensure they comply with the FAIR1 principles of data publishing; that data are findable, accessible, interoperable, and reusable. Even researchers that publish smaller datasets from single studies will benefit from familiarizing themselves and implementing the FAIR guidelines. Indeed, many journals have developed data publishing guidelines based on these principles. Taking the time and care to publish your data within these principles ensures that future researchers can make use of the hard-earned understanding of water resources we are developing today.

Despite the challenges, it is worth learning and adhering to good data management principles and publishing your data. In 2023, my collaborators and I published a large study characterizing the connectivity of millions of wetlands across the conterminous US (CONUS)2. This work was based on and benefitted directly from challenges and lessons we learned publishing two large datasets of geospatial watershed information, StreamCat3 and LakeCat4, which characterize hundreds of natural and anthropogenic features for millions of stream and lake watersheds across CONUS. Since publication, StreamCat and LakeCat data have been used in numerous unanticipated ways by researchers and practitioners across academia, government, and non-profits to understand landscape drivers of the physical, chemical, and biological conditions and quality of inland waters. As a researcher, it is deeply satisfying to see your data find wider utility in the water research and management communities. Ensuring that data are developed, shared, and curated in a way that follow best practices will facilitate this use and benefit future research and society.

Shihong Lin

When we published our paper in Nature Water5, we chose to share our source data and code in public repositories to simplify access for readers interested in reproducing our figures for any purpose6. Data-sharing platforms like Figshare, which integrate seamlessly with the Nature Portfolio submission system, make the process convenient and efficient. In our case, the shared data primarily includes tables of values used to generate plots in the main text and supplementary information. While sharing such source data may not be as crucial as sharing complex, multi-dimensional datasets (for example, genomics data, molecular dynamics simulations, or sensor data), we still believe in making all types of data accessible for readers interested in further analysis or reproducing figures for review articles.

Although the importance of sharing source data may vary depending on the dataset type, sharing source code is always essential when simulations play a key role in the research. Today, source code is mostly shared via public repositories like GitHub. Access to code is critical for enabling readers to reproduce results or extend models to explore new dimensions of research. In our recent Analysis paper published in Nature Water5, we introduced a framework for evaluating the performance of nanofiltration for selective ion separation. We shared the source code on GitHub (https://github.com/ruoyuwang16/NATWATER-22-0394-Data-and-Codes) and the datasets on Figshare6, with the hope that future researchers could readily use this package to assess the performance of nanofiltration-based selective solute separation.

Three months after publishing this paper, we were contacted by Professor Davies from the University of Birmingham, who encountered difficulties in reproducing our results. Through several email exchanges, we clarified certain confusions but also discovered critical errors in our code and the paper, such as missing files and inconsistencies between figures and source data. We addressed these issues by updating our GitHub repository and submitting an Author Correction7 to provide the necessary clarifications. Had we not made our source code so accessible, only a few readers would have attempted to use our proposed framework, and we might never have identified these errors that needed correction.

Erik Svensson Grape

As essentially all research data in the physical sciences are acquired, stored, and analysed in an electronic format, the idea of making that data available, not only to reviewers but to a larger community, appears a reasonable ambition. Doing so would permit further analysis, for example by experts in a subfield, while also promoting openness and transparency in research.

Making data available is actively encouraged by most scientific publishers, yet is commonly limited to the inclusion of a statement along the lines of “data is available from the corresponding author upon request”, and is rarely enforced. The availability of that data likely varies, and access may be further complicated as affiliations and email addresses change over time. From a general perspective, other questions also come to mind: how may that data be processed, and must all data be made available?

Having a background in inorganic chemistry, where most measurement techniques generate output in a spreadsheet-like format, little effort is required to curate and make data available. Still, this remains rare even in my specific field of research. Overall, I believe that all sciences stand to benefit from greater openness, and that it inevitably starts with a shift in attitude and expectations from the communities themselves — data availability should be the norm.

In my experience, the use of online repositories greatly lowers the barrier to making data available, which is why my co-authors and I chose to deposit the raw data of our recent study, as published in Nature Water8, in a general repository9. We did so wanting to showcase the techniques used and to be fully transparent with our results. Considering the ease of this process, I will continue to do so at every opportunity, and greatly appreciate seeing others do the same.

Outside of general repositories, the curation of databases pertaining to specific types of data can be incredibly valuable. Depositing data in a standardized format allows for systematic validation and efficient cataloguing of what may otherwise be an incomprehensibly vast body of results. Databases such as the protein databank for macromolecules and crystallographic databases for small molecules and materials have made an incredible impact in structural chemistry and biology, allowing for large-scale assessments and easy validation of results. As an example, the 2024 Nobel Prize in Chemistry, awarded for computational protein design and protein structure prediction, was greatly facilitated by the use of the protein databank, which is notably open access, that is, free for anyone to use. Again, this ties back to the concerted efforts of a scientific community, as the establishment of standardized formats and new norms of data availability is no small feat. This shows that curating data and making it publicly available can lead to highly impactful scientific innovation. It also raises questions regarding what such data may be used for, including further analysis by other researchers or as training sets for machine learning. Challenges aside, I believe that repositories will play a critical role in the move towards open and transparent science.

Yongqiang Zhang

In the rapidly evolving field of hydrology, sharing data through public repositories has become essential for fostering transparency, collaboration, and innovation. My experience with the dataset associated with my and my co-authors’ paper published in Nature Water in 202310 underscores the significance of making research data accessible to the broader community.

The decision to share the dataset was driven by several key considerations. First, journal requirements for transparency and reproducibility are increasingly recognized as standard practices in high-impact publications. Additionally, by sharing data, we aimed at maximizing the impact of our research, enabling other researchers to build upon our findings and extend the knowledge base in the field. This openness enhances collaboration and fosters a culture of shared learning.

While sharing the streamflow elasticity dataset derived from original daily flow and forcing data was relatively straightforward, we encountered challenges with the original daily flow datasets, which were compiled from various state water agencies acting as custodians. We also noticed that in developing countries, obtaining streamflow data — especially for transboundary rivers — can be significantly more difficult compared to developed countries like the USA, Australia, and those in Europe. The lack of available streamflow data in these regions introduces uncertainty when extrapolating modelling results, highlighting the need for smoother processes for accessing and sharing governmental datasets.

Since uploading data to Zenodo11, we have observed significant engagement, with over 831 downloads to date. This response indicates a growing interest in my research and emphasizes the value of data sharing. We have also received inquiries from students and colleagues seeking clarification on the dataset and its applications, reinforcing the importance of providing detailed documentation for users.

In addition to this specific dataset, we are sharing a 500-m resolution global PML-V2 evapotranspiration dataset covering the period from 2000 to 2023 through Google Earth Engine. This platform has attracted even wider usage, further demonstrating the demand for accessible, high-quality datasets in hydrological research. Similarly, we are also sharing our gap-filled GRACE satellite data.

In conclusion, my experience with data sharing has been overwhelmingly positive. Not only did it enhance the visibility and impact of my and my co-authors’ research, it also showed me that that data sharing contributes to a collaborative environment where researchers can innovate and tackle pressing water-related challenges. As we advance our understanding of hydrological processes, embracing streamflow data sharing as a fundamental practice will be crucial in shaping a more sustainable future for water resources.

Oliver S. Schilling

Although water is featured explicitly in only 1 out of the 17 United Nations Sustainable Development Goals (SDGs), it comes as no surprise that water is considered vital in reaching most, if not all, SDGs. Water is of such importance that studies that have analysed its importance in relation to the SDGs have even coined it as the ‘common currency’ linking all 17 SDGs. Unfortunately, while good progress is being made in reaching about a third to half of the SDGs, none of the sub-targets of the water SDG (SDG 6) are currently on track to being met by the target year. It is thus critical that we make every effort to contribute to improving the state of water on our planet and make fast progress towards reaching the goals of SDG 6.

What could be the ‘common currency’ for developing more sustainable and resilient water management plans and guaranteeing access to safe drinking water? It is, of course, data on water. This is particularly true for groundwater, which is the largest accessible freshwater reservoir on Earth and at the same time a mostly invisible one. A clear assessment of the temporal and spatial distribution of groundwater quantity, quality, and renewal rates, and its interactions with surface water bodies and the cryosphere, is critical for designing resilient and sustainable water resource management plans12. This task can only succeed if we obtain, curate, analyse and distribute high-quality data on groundwater according to FAIR principles. This is the avenue that we’ve decided to take for our research, and particularly for our study on the groundwater origins in the geologically complex, volcanic watershed of iconic Mount Fuji in Japan13.

Mount Fuji watershed can be considered the birthplace of Japanese hydrogeology, with many springs and groundwaters around Mt Fuji having now been studied for almost 100 years. To better understand Mt Fuji watershed, and to develop fundamental concepts for the management of volcanic island aquifer systems in the face of climate change, we have set it as our mission to not only obtain new data, but to also gather, curate, analyse, and review all possible existing hydrological tracer data that was produced throughout the last century and that we could find. It was only in combination with this vast amount of existing data that our data could be fully understood and interpreted, allowing new light to be shed on the origins of Mt Fuji’s groundwater. Publishing the entire curated hydrogeological database as a FAIR dataset, and thereby hopefully sparking a more resilient and sustainable management of the water resources of volcanic island aquifer systems, was a goal from the very beginning. We’ve chosen to publish our database on the water-focused FAIR repository HydroShare14. Not only do we now see a renewed general interest in Mt Fuji’s hydrogeology, owing to the publication of our study and FAIR database, we have also been able to launch many new collaborations and projects — and make many new friends along the way.

Damien Voiry

The reclamation of wastewater demands innovative and scalable solutions to ensure clean water access for growing populations. Nanotechnologies offer new perspectives in the field of water purification via the development of novel catalysts and membranes that leverage the unique properties of nanostructured materials. Twenty years after the rise of graphene and other two-dimensional (2D) materials, 2D-based membranes have emerged as a mature field of research for applications ranging from energy generation to gas separation and water purification.

Nanolaminate membranes — created by vertically stacking exfoliated nanosheets — feature ultra-thin, layered architectures with precise control over molecular sieving properties. While much research has focused on reducing membrane thickness, engineering the stacking architecture via covalent functionalization presents a new frontier. This approach allows fine-tuning of surface chemistry and stacking order, resulting in exceptional separation performance, including high rejection rates for contaminants and improved permeability compared to conventional membranes15. Such capabilities make nanolaminate membranes highly attractive for applications such as desalination, industrial wastewater treatment, and the removal of emerging contaminants like pharmaceutical residues and microplastics.

Addressing the pressing challenge of clean water availability is unlikely to be solved by a single technology, material, or research group. Progress in this field requires a collaborative, multidisciplinary approach. To move forward, we believe it is crucial to make data both accessible and understandable, with public repositories playing a key role in fostering transparency and innovation. Our group has been committed to this effort since 2019, following the initiation of a European Commission-funded project. The emergence of high-throughput experimental strategies is further revolutionizing the field, generating vast datasets that make sharing even more essential. In the age of artificial intelligence (AI), these shared datasets are invaluable for training predictive models that can guide the design of future membranes16. This aligns closely with our research focus, as we are now leveraging machine learning to predict the most suitable functional groups for attachment to nanosheet surfaces. By optimizing interlayer distances, stacking order, and mechanical properties, these predictions aim to refine the design of nanolaminate membranes and accelerate their development.

Data sharing, combined with advancements in AI and high-throughput methods, represents a transformative pathway towards the creation of scalable, high-performance membranes capable of addressing global water challenges. Bridging the gap from lab-scale breakthroughs to large-scale deployment will require concerted efforts in scaling fabrication processes and refining membrane structures. In this context, the community will find a solid foundation in data sharing to advance the field. Success in these areas could unlock the potential of nanolaminate membranes, positioning them as a viable technology for sustainable water purification.

Related Articles

Toward change in the uneven geographies of urban knowledge production

More than four-fifths of the global urban population live in the Global South and East. Most urban theories, however, originate in the Global North. Building on recent efforts to address this mismatch, this paper examines the geographies of urban knowledge production. It analyzes the institutional affiliations of contributions in 25 leading Anglophone journals (n = 14,582) and nine urban handbooks (n = 252). We show that 42% of the journal articles and 17% of the handbook chapters were authored outside the Global North. However, only 15% of the editor positions (handbooks: 10%) were held by scholars based outside the Global North. This indicates that Global Northern institutions still dominate knowledge gatekeeping, whereas authors are more diverse. Additionally, more empirical journals and those with fewer Northern board members tend to publish more non-Northern authors. Our findings underscore the need for greater epistemic diversity in gatekeeping positions and broader understandings of what counts as theory to better incorporate diverse urban knowledge.

The risk effects of corporate digitalization: exacerbate or mitigate?

This study elaborates on the risk effects of corporate digital transformation (CDT). Using the ratio of added value of digital assets to total intangible assets as a measure of CDT, this study overall reveals an inverse relationship between CDT and revenue volatility, even after employing a range of technical techniques to address potential endogeneity. Heterogeneity analysis highlights that the firms with small size, high capital intensity, and high agency costs benefit more from CDT. It also reveals that advancing information infrastructure, intellectual property protection, and digital taxation enhances the effectiveness of CDT. Mechanism analysis uncovers that CDT not only enhances financial advantages such as bolstering core business and mitigating non-business risks but also fosters non-financial advantages like improving corporate governance and ESG performance. Further inquiries into the side effects of CDT and the dynamics of revenue volatility indicate that CDT might compromise cash flow availability. Excessive digital investments exacerbate operating risks. Importantly, the reduction in operating risk associated with CDT does not sacrifice the potential for enhanced company performance; rather, it appears to augment the value of real options.

First-principles and machine-learning approaches for interpreting and predicting the properties of MXenes

MXenes are a versatile family of 2D inorganic materials with applications in energy storage, shielding, sensing, and catalysis. This review highlights computational studies using density functional theory and machine-learning approaches to explore their structure (stacking, functionalization, doping), properties (electronic, mechanical, magnetic), and application potential. Key advances and challenges are critically examined, offering insights into applying computational research to transition these materials from the lab to practical use.

Why do travelers discontinue using integrated ride-hailing platforms? The role of perceived value and perceived risk

Despite integrated ride-hailing platforms have provided many benefits to travelers, there are also various potential risks. This study aims to examine travelers’ discontinuance behavioral intention toward integrated ride-hailing platforms. The research framework was established by extending the theory of planned behavior (TPB) with perceived value and perceived risk. Perceived value was classified into utilitarian, hedonic, and social values, while perceived risk was classified into privacy, performance, security, and financial risks. Additionally, the factors of switch cost and personal innovativeness were included. An empirical analysis was carried out using partial least-squares structural equation modeling (PLS-SEM) based on a survey conducted in Nanjing, China. Furthermore, a multi-group analysis (MGA) was performed to examine behavioral differences across demographic variables. The findings suggest that discontinuous behavioral intention is influenced by subjective norms, perceived behavioral control, and attitude. Among them, perceived behavioral control shows the strongest impact (−0.190). Perceived value, including utilitarian, hedonic, and social dimensions, negatively influences discontinuance intention, whereas the four variables of risk perception positively affect discontinuance intention. Notably, social value, performance risk, and privacy risk act higher total effects on discontinuance intention. Switch cost is negatively associated with attitude (−0.222), and positively affects discontinuance intention (0.189). Personal innovativeness has positive and stronger effects on perceived value (0.237), negative effects on perceived risk (−0.174), and negative effects on discontinuance intention. Regarding MGA results, older travelers demonstrate a stronger impact of social value on perceived value, higher-income groups exhibit greater sensitivity to security risks, and frequent travelers prioritize utilitarian value.

Rapid brain tumor classification from sparse epigenomic data

Although the intraoperative molecular diagnosis of the approximately 100 known brain tumor entities described to date has been a goal of neuropathology for the past decade, achieving this within a clinically relevant timeframe of under 1 h after biopsy collection remains elusive. Advances in third-generation sequencing have brought this goal closer, but established machine learning techniques rely on computationally intensive methods, making them impractical for live diagnostic workflows in clinical applications. Here we present MethyLYZR, a naive Bayesian framework enabling fully tractable, live classification of cancer epigenomes. For evaluation, we used nanopore sequencing to classify over 200 brain tumor samples, including 10 sequenced in a clinical setting next to the operating room, achieving highly accurate results within 15 min of sequencing. MethyLYZR can be run in parallel with an ongoing nanopore experiment with negligible computational overhead. Therefore, the only limiting factors for even faster time to results are DNA extraction time and the nanopore sequencer’s maximum parallel throughput. Although more evidence from prospective studies is needed, our study suggests the potential applicability of MethyLYZR for live molecular classification of nervous system malignancies using nanopore sequencing not only for the neurosurgical intraoperative use case but also for other oncologic indications and the classification of tumors from cell-free DNA in liquid biopsies.

Responses

Your email address will not be published. Required fields are marked *