Data Sharing Q&A
Learn about data sharing from our expert editors
Understanding Research Data
What is data sharing?
Your questions answered on all aspects of data sharing and open research
We held a Q&A on Twitter with a panel of expert editors to answer your questions about data sharing and open research. Read the full interview to learn more about data sharing and how it can benefit your research.
- Professor Amar Abderrahmani (Lille University, France), Editor-in-Chief of All Life
- Dr Kevin Tyler (University of East Anglia, UK), Editor-in-Chief of Virulence
- Dr May Yuan (University of Texas at Dalla, USA), Editor-in-Chief of International Journal of Geographical Information Science (IJGIS)
- Dr Urska Demsar (University of St Andrews, UK), Associate Editor of IIJGIS
- James Barker, Publishing Executive at F1000Research
See the original interview from 30th March 2021 on Twitter.
Q: It's time to talk data sharing! Joining us are a panel of expert Editors; would you like to introduce yourselves?
Amar: I am Amar Abderrahmani, Professor at Lille University and the Editor in-Chief of All Life, a broad scope multidisciplinary open access journal in the life sciences. All Life was born in 2020 to meet the needs of authors and society. Subject-led sections are linked to the United Nation’s Sustainable Development Goals, providing a platform for reaching policymakers and professionals tackling today’s problems.
Kevin: I’m Kevin Tyler, the Editor-in-Chief at Virulence, an open access elite science journal focused on microbial pathogenesis and pathogenicity. I am based at the Norwich Medical School, which is part of the University of East Anglia and the Norwich Research Park in the UK.
May: Hello, I’m May Yuan from Univ of Texas at Dallas. I am the chief editor of IJGIS, which publishes research on GIScience applications in natural resources and social systems and developments in computer science and cartography. Happy to discuss sharing data and codes with IJGIS publications.
Urska: Hi all, Dr Urska Demsar here as Associate Editor of IJGIS. Today I will be tweeting here about Open Data, along with our EIC Prof May Yuan.
James: Hi, it’s James Barker here, Publishing Executive at F1000Research. F1000Research is an Open Research platform with reproducibility and transparency built into our rapid publication model.
Q: Welcome everyone, and thanks for joining us! Researchers often have many questions on data sharing, so let’s start with the basics: what is research data?
James: Research Data describes all the info & materials associated, collected or produced as part of a research project. This can be raw underlying data (e.g. data tables) and any additional, extended data (e.g. surveys & questionnaires).
Kevin: Research Data is a catchall term that describes any information which is used in answering of research questions. In general research papers report the answers to research questions and the evidence they provide is the research data which has been analysed and interpreted by the authors. In the case of Virulence this data often includes large experimental datasets from platform technologies some as genomics, transcriptomics, proteomics and metabolomics but also may rely on films, photographs and assay results and clinical datasets which may themselves be subject to restriction in its usage. We pride ourselves on being a multidisciplinary journal and so flexibility in terms of the research data we can accept is important.
Amar: Research data is any information that has been collected, observed, generated, or created to contribute to the original research findings. Research data can confirm or invalidate a scientific hypothesis.
May: Research data in the context of publications are data and codes sufficient to reproduce findings reported in the paper and to serve as the basis for replication consideration.
Urska: In my opinion, research data are any data that are generated in research projects. This can be primary data, but also derivations of primary data (model results) or code, which is particularly important for GIScience research which develops new methods.
James: I completely agree that code and Research Software count as Research Data - it's so important that these are shared openly!
Urska: As a GIScientist, a lot of my responses today will actually be about Research Software and Open Code, as this is an important part of research in my discipline.
Q: How important is data sharing and open research in your community? Have you seen attitudes changing in recent years?
Kevin: A cornerstone of research is reproducibility, and this necessitates transparency and accessibility when it comes to research data. There is and has been a consensus within life-science publication that all the data required to validate a claim and reproduce reported findings should be provided by authors in order to secure publication and editors policing of this policy has been key to ensuring research data sets has been made freely available by authors to their communities.
Often however, the data presented has already been compiled by the authors to facilitate interpretation by the readers. What has changed in recent years in the expectation that not just compiled data, but actually complete raw datasets should be provided either directly associated and co-published with manuscripts or via made freely and publicly available through public databases in advance of publication.
Amar: All of us know that data sharing and open research science lead to reliable, reproducible and impactful science, and trustable scientists who will have scientific reputations with greater citation and stronger collaboration.
James: We have always advocated for Open Research & Open Data. If the original datasets aren't available for review, readers have to assume that data collection & analysis are correct. Data sharing aids open peer review as these assumptions don’t have to be made! Open Data is now widely accepted in the Life Sciences and in many cases is also required by funders. We’re seeing data sharing spreading to other disciplines such as Social Sciences too.
May: IJGIS started data sharing requirements in August 2019 and now fully implements the policy. Some resistance arose at the beginning. Attitudes have changed in the last 12 months, and now all our manuscripts, except for review papers, share data and codes.
Urska: I work in movement analytics and develop methods for movement ecology, (which studies animal movement) and human mobility. In ecology, open data are a well-established tradition and it is now a default to place your data to open repositories, e.g. Movebank or Motus. Ecology journals and other journals where ecologists publish (general ones PloS One) require publication of data on submission & this is then published either in the journal or in the repository parts of the portals. Ecologists also develop their methods in Free and Open Source Software (mostly R) and publish them online.
On the human mobility side, data sharing is IME less developed, primarily because of problems with geoprivacy and commercial nature of data which prohibits public publication due (in contrast, ecology data are usually collected by academics themselves, who are the owners and can decide if they want to publish them openly or not).
This has changed somewhat during the COVID-19 pandemic, where companies which collect human mobility data (e.g. mobile phone providers, big IT companies, such as Facebook, Google, etc.) have offered them in various “Data for good” schemes (e.g. Google's COVID-19 Community Mobility Reports). These however don’t come in raw form, but are aggregated to larger spatial/temporal scales, so they are not quite the same type of open data as the ones that are collected by researchers (not companies).
With respect to open code, there is also less of that in human mobility, since researchers often come from disciplines like physics and computer science, which often publish their methods as pseudocode in the papers, but not as open code. Some areas of human mobility are however much better at opening code, for example transportation (follow @robinlovelace for open R tools for transportation) and GIScience itself (@underdarkgis does open movement analytics in Python).
Q: What is the data sharing policy for your journal? Why was this type of data policy adopted?
Kevin: Virulence remains committed to ensuring the provision of all required research data and means and methods by authors to ensure validation and reproducibility of the findings reported by authors are available to both reviewers and readers alike.
Amar: In All Life, data and materials supporting the results or analyses, respectful of ethics, privacy and security, are required to be freely available. We encourage submissions in recognised public repositories that issue datasets with or without DOIs. Authors may, however, choose to apply a licence that limits re-use. The data policy we adopted is clear, transparent and useful for the scientific community.
May: IJGIS requires authors to anonymize data and codes with instructions in a cloud-based repository during peer reviews. Upon provisional acceptance of a manuscript, authors de-anonymize the data and codes, obtain a DOI to make the data and codes citable, and make the data and codes freely accessible in a public repository. Every submission will go through checking on codes and data for research reproducibility and replicability.
James: On F1000Research *all* data associated with an article must be provided & made openly available via an online repository prior to publication. This includes all data underlying results & all supporting materials (e.g. questionnaires, code). This Policy helps to ensure research is fully reproducible, because open data allows readers & reviewers to verify & replicate results in conjunction with the reported methods.
Urska: IJGIS follows the FAIR principle of open science, that is, data and code need to be Findable, Accessible, Interoperable and Reusable. In practice this most often means that if a paper creates data/code, they need to be placed in a repository where they get a stable DOI and can be shared (e.g. Zenodo, Figshare). For the code, I find the best practice is to have an evolving version on Github, where the current version can be downloaded and used directly, and a timestamped version with a DOI on Zenodo (e.g. the version when you submitted the paper). The policy was adopted to follow the principles of open science and allow scientific reproducibility.
James: The FAIR Principles underpin our Open Data Policy too - we've actually got a great introductory guide for researchers who haven't heard of FAIR Data before.
Urska: That is a nice explanation for anyone who doesn't know what FAIR stands for!
Q: How much do you think authors understand your data sharing policy? What challenges are there around data sharing?
Kevin: Authors are members of the community and read each other’s research as well as communicating their own. In general, since major funders are often ahead of publishing houses in thinking about how the research data that they have commissioned on behalf of their stakeholders should be communicated in the most fair and open way. The biggest funders of academic science actually tend to stipulate open data as preconditions of funding and so authors who tend to be recipient of the funds only when they are able to demonstrate this awareness are also highly aware themselves. Moreover, because science policy is largely driven from within the community it is often our authors who themselves drive funder open data policy.
James: Data Sharing is a requirement at F1000Research so in general our Data Sharing policy is well understood! Researchers in fields we have been publishing in for a while (bioinformatics & genetics) generally have a strong understanding of FAIR Data and understanding of Open Data is growing in fields like physical & Social Science. The good news is we work with submitting authors across all fields to help them ensure their data meets FAIR Data standards during the pre-publication process.
Amar: The number of submissions to All Life is tremendously increasing, indicating that authors totally adhere to our data sharing policy. I think most reservations about sharing data is more related to either the non-approval of the author’s institution, or inertia with lack of understanding rather than actual conflict with the idea of openness. So, our job is to continue our communication campaign about the opportunities to share the data.
May: We work with individual authors to get their data and codes ready for peer reviews and production. Data subject to the privacy, confidentiality, or licensing issues cannot be shared publicly. We consult the authors to create simulated data to demonstrate how their codes work. We inspect the codes and inform authors of any missing data or data processing issues that authors need to address before sending the manuscript to peer reviews.
Urska: Once papers come to me as the Associate Editor, I find that the authors have a good idea of what they are required to do. Personally, I have had the following queries from authors: 1) what if we use data that can’t be shared for whatever reason? No problem, T&F's open and fair data sharing policy allows for that. 2) What if our paper doesn’t generate any data/code and/or we use GIS software for analysis rather than write our own code? No problem, that is of course also still ok, just specify that in your Data availability statement.
James: We see similar queries from authors submitting to F1000Research! It always helps if authors just get in touch with our Editorial team so we can clear up any confusion around Open Data and our policies. Communication is key!
Q: What are the benefits of data sharing for researchers?
Urska: As I said, I work with movement ecologists where data sharing is the norm and I have found this extremely useful, both for my own research and for the research of my students. This allowed us to both identify new data science research problems and establish new interdisciplinary collaborations that can then solve problems in a better way than if the result was limited to only one discipline. I think this is a general phenomenon across GIScience, which often works with other disciplines.
Kevin: Science moves forward most successfully in a collegiate, collaborative manner. Most authors are keen to share and promote their datasets so most feedback relates to how we can best enable authors to share their data most effectively.
May: Research reproducibility and replicability are essential to scientific progress, going beyond individual case studies and allowing model transfer and comparison. It heightens authors’ vigilance on data processing, calculation, and workflows.
James: 1) Transparent, open data adds to the credibility of your research and contributes to public trust in science. 2) Choosing open data supports the wider research community and can even lead to new collaborations & opportunities. 3) Funders are also increasingly requiring that research data produced from their grants is made open, so practically it can help researchers meet their funder requirements.
Amar: As a researcher, sharing my data is a proof of transparency, reliable and reproducible research findings, honouring a true science.
Q: Have you seen an impact of your data sharing policy on the peer review process? What feedback have you had from reviewers?
Kevin: Very little feedback from reviewers on this issue per se but the reviewers and editors tend to be of one mind when it comes to policing of open data. Normally authors know what is expected and do it without being reminded. Sometimes they find c orrect provisioning a little tricky to accomplish and we try to facilitate this with them. Reviewers do sometime struggle with the potentially large number of complex analyses supplied with first pass analyses of some of the large datasets. These reviews are often large and complex tasks to do well and we may as a result need to use additional reviewers or grant more time for reviewers in order to have such reviews undertake well. There is an ongoing general debate on how we reward our reviewers for the work they do for us which is particularly salient in cases in such cases where each reviewer can take a really large amount of reviewer time but the most competent reviewers are in short supply.
May: Reviewing data and codes is optional for reviewers; many of our reviewers do look at the data and have raised questions about data manipulations, inappropriate use of functions, or missing normalization.
Urska: For us, finding reviewers is not just about being competent to check the data, but having the skills in the right software. E.g. you can't ask someone working w Python to check reproducibility of R code. The question is then how to not overload the same people all the time. Since I handle papers across a variety of topics and also invite specialised reviewers from other disciplines if necessary, I have found a range of opinions on this. Specifically, going back to movement analytics, if I get a reviewer from ecology, they will assume data are available and will comment if they are not. Alternatively, there is a large community in transportation who produces open software tools and they will ask for that and further try to run the code to test reproducibility (and advise what to do if it’s not). I recently accepted a paper that introduced a new R package for human mobility and I (and the reviewers) thought it was a really good initiative that IJGIS is now receiving and publishing papers like this, where the focus is on development of reproducible solutions and open tools for GIScience problems.
Amar: In All Life, the impact is translated in the growing number of manuscripts that are not rejected before the reviewing process. Instead, the number of manuscripts undergoing peer-review is increased, indicating the total adherence of the authors to our policy. We find that our sharing data policy makes the reviewer’s work easier and more appealing as the data behind the research are transparent.
Q: What are the benefits of a data sharing policy for the research community in general?
Kevin: Benefits of data sharing are increasingly recognized and there is substantial innovation towards sharing new types of datasets. Feedback is that authors expect publishing must keep up and find mechanisms to communicate new datasets in new formats.
Amar: There are several benefits: For the scientific community, it elevates the scientific rigor of discussions between researchers. Openness gathers more people from different disciplines, leading to interdisciplinarity and speeding of the original scientific discovery. In Life Science, unfortunately, the numbers of unreliable manuscripts are too numerous. Data sharing operates a change in the mind and practice of science, particularly in young scientists. As seniors, we have a duty to show exemplary science of excellence.
Urska: As I said, I find data and code sharing extremely useful as it allows stepping across disciplinary boundaries and in the case of GIScientists, bringing together us as spatial data specialists with researchers in other disciplines who collect large quantities of spatio-temporal data, but who may not have the data science expertise to analyse them. Joining the two groups together via open data and open tools/code, allows us to solve problems that would otherwise be too difficult to address.
James: Author David Mobley says it best: “When we make our work available in a truly reproducible manner, we find that other researchers build on, reuse, and extend the work in ways far beyond what we might have originally imagined”. It increases the visibility of our work, and helps science progress better and faster. Everyone wins.” There you have it - with Open Data everyone wins! There is even evidence that linking Research Data to publications increases citations - a study in 2020 found articles with a data statement linking to a repository had up to 25% MORE citations than those without.
Q: What do you think the future holds for data sharing and open research in your field? How do you see attitudes evolving over the next few years?
Kevin: Well there is certainly no going back. All data should be shared once it is obtained, it’s not ethically acceptable to use public funds to obtain important data and then hoard it or sit on it for any length of time whilst selecting bits for publication Publications in which extensive expert analysis is conducted and answers to important questions are obtained do take a considerable amount of time for authors to prepare and for the rigor of peer review to be applied but the community wants the unprocessed data available at soonest opportunity so I think there will be increased facility to access datasets early and preprints style facilities will continue to facilitate that – opportunities to access collaborative projects and project documents may increase in parallel as there is wider access available and more flexibility for amendment of these preprints before they enter the scientific record. Some of the concerns about how to effectively review really extensive analyses of large datasets may begin to be approached through extension of the traditional anonymous peer review to incorporate elements of pre-submission post-publication peer review and I think this is really being well demonstrated in how some of the pre-prints associated with the COVID pandemic are being managed and refined.
James: We’re really excited for the future of Open Data – seeing more fields embracing data sharing practices is a really positive step for the whole research community. Hopefully we also start to see Research Software and code included within data policies as standard – something we’re already doing at F1000Research as we know how important code is to reproducibility.
May: Better cyberinfrastructure with easy tools to share data and codes and run data and codes for individual tables and figures inside publications. In addition, that the cyberinfrastructure can facilitate queries of the shared data and codes. The cyberinfrastructure has the potential with AI capabilities to help authors fix issues with data and codes, learn and structure knowledge networks of data and codes, and suggest synergistic opportunities to advance research.
Urska: I think it will become the norm across the board. As I said, there are some disciplines that are ahead, like ecology, and some that are not quite there yet, for whatever reason. For example, as I mentioned, geoprivacy is a big issue for human mobility data, because it is easy to find the place of home or work of an individual from GPS trajectories or from mobile phone data (just think how many phones commute daily between the places of your home and your work, it’s probably just your phone, so you could be easily identified). However, there is now on-going research on new approaches on how to anonymise human mobility data. We also discussed geoprivacy, open data and open code in our recent review of movement analytics, where we aim to bring together methods from animal movement and human mobility.
Amar: I am optimistic about the adherence to our data sharing policy in all disciplinary sections. Pragmatically, the idea of a data sharing policy is a young attitude, which still needs to maturate in all scientific minds for being totally accepted in few years.
Q: And finally, what’s your top tip for researchers who are new to the concept of data sharing?
Kevin: Some people see the acquisition of data as an end in itself and the associated publication as simply an advert for the new dataset. However, publication is actually driven by the questions that it answers – sometimes these answers are representative of the kind of questions that a powerful dataset appropriately mined will be able to answer – sometimes they are the main justification for acquiring the data set but do not exhaust the possible question the data can answer. Modern publishing needs to reconcile the competing demands for rapid communication of open data to a community hungry to use the information for their own purposes with the requirement to provide high quality communication of robust, reliable and validated resolutions of important questions in the scientific record. I am happy to say that for Virulence we are committed to Open Access, Open Data, and Open Research as our cornerstones for the foreseeable future.
Amar: Top tips for sharing reliable and strong data are 1) curate data contributions. Provide enough data at the right and valuable formats. 2) Take care about the data collection and clarity. provide the raw data, with the protocols and information of how and where the data was collected. 3) Be helped and advised. For points 1 and 2, you may need expertise in statistics and bioinformatics. 3) Also, check with the institution’s oversight or ethics committees regarding the research protocol and questions about integrity.
James: Sorry, I can't pick just one - so you're getting three top tips from me! 1) Don’t make it complicated! It’s actually pretty easy to work out what data you need to provide from your methods & results, and anything supporting these like code/surveys. Prepare the files and upload them to a repository, it’s that simple! B e sure that you are providing the raw data. If you made any changes to the data, like taking an average or adjusting the contrast on an image, it’s not your raw data! 3) If you are unsure about F1000’s Open Data Policy, just email us with your draft submission and we can help you identify the data we require! You can visit F1000’s Open Data page for some handy resources.
May: Prepare sharing data and codes at the onset of any research project, so they will be ready when you are ready to submit the manuscript.
Urska: Just do it. Publish your own data & be open to new collaborations when others re-use it. Explore what data exist, which might give you new ideas and bring you together with others who are interested in the same things as you are.
Q: We've come to the end of today's chat. Thank you to our participants! How can researchers find out more?
Kevin: Thanks for hosting us for this important discussion. Lots of details about the journal on the website. The Editorial team at Virulence look forward to to receiving and reviewing your exciting new and openly accessible data! Find out more at http://bit.ly/KVIRTalkDataSharing.
James: Thanks for having us, this has been fun! Head to our Open Data resource hub for more information on all aspects of Data Sharing, including how to prepare your data for submission to F1000Research.
May: Thank you. To learn more about the International Journal of Geographical Information Science visit bit.ly/IJGISTalkDataSharing.
Urska: You can find us on our website or on twitter (@ijgis), where we tweet recently published papers. We also have a brand new IJGIS webinar series where we invite authors of popular papers to present their work. Webinars are recorded and available here on our YouTube channel. Thanks for having both me and our EIC as guests today, research data are an important topic and I am glad that we could contribute to this discussion.