Collaborative platforms for streamlining workflows in Open Science

From Species-ID
Revision as of 02:32, 9 May 2011 by Daniel Mietchen (Talk | contribs) (Licensing)

Jump to: navigation, search

About

This page hosts the draft for a contribution to the OKCon 2011. Its content is derived from a contribution proposal (CC-BY-SA by Konrad U. Förstner, Gregor Hagedorn, Claudia Koltzenburg, M Fabiana Kubke, Daniel Mietchen) to the New Zealand eResearch Symposium 2011.

Submission Text

Title

Collaborative platforms for streamlining scientific workflows

Alternative with stronger emphasis of the openness: Collaborative platforms for streamlining workflows in open science

Authors

(Please add your name, affiliation) Authors listed in alphabetical order.

  • Konrad U. Förstner, Institute for Molecular Infection Biology, University of Würzburg, D-97080 Würzburg, Germany and Research Centre for Infectious Diseases, University of Würzburg, D-97080 Würzburg, Germany
  • Gregor Hagedorn, Julius Kühn-Institute, Federal Research Center for Cultivated Plants, Berlin, Germany
  • Claudia Koltzenburg, Managing editor of Cellular Therapy and Transplantation (CTT)
  • M Fabiana Kubke, Department of Anatomy with Radiology, University of Auckland
  • Daniel Mietchen, Science 3.0

Abstract

Despite the internets's dynamic and collaborative nature, scientists continue to produce grant proposals, lab notebooks, data files, conclusions etc. that stay in static formats and/or are not published online and due to this are not accessible to the interested public. Because of limited adoption of tools that seamlessly integrate all aspects of a research project (conception, data generation, data evaluation, peer-reviewing and publishing of conclusions), much effort is later spent on reproducing or reformatting individual entities before they can be repurposed as parts of articles or independently.

We propose that workflows - both individual and collaborative performed - could potentially be made more efficient if all steps within the research cycle would be coherently represented online and the underlying data were formated, annotated and licensed for reuse. Such a system would accelerate the process of taking projects from the conception phase to the publication stage and allow for continuous updating of the data sets and their interpretation as well as the integration into other independent projects.

Another advantage of such workflows is that the process can be made transparent, both with respect to the scientific process and to the contribution of each participant. The latter point is important from a perspective of motivation, as it enables the allocation of reputation which creates incentives for scientists to contribute to projects. Such workflow platforms offering possibilities to fine-tune the accessibility of their content could gradually pave the path from the current static mode of research presentation into a more coherent practice of open science.

Introduction

Like most areas of todays life, science has dramatically chanced since the advent of the internet. However, the transformation that has taken place until now is just the tip of the iceberg. In the following, we want to discuss the mostly underutilized potential of representing all aspect of science in collaboratively used online workflow platforms. As such platforms could help to realize Open Science, the practice to grant transparency and access to all data in the research process, we will shed light on this special aspect and make recommendations regarding implementations.

While there are numerous projects developing and applying so called Virtual Research Environments (VRE) - also known as Collaboratories - which cover selected stages of the scientific process, a platform covering every phase is missing so far [1]. Technically overcoming such gaps and creating a seamless transition from bench to publication could speed up the research and with it the generation and reuse of knowledge.

The scientific workflow

Conception and Project Planning

Independent of the nature of a research endeavor - hypotheses-driven or data-driven, performed by a single person or with many parties involved - a solid conception phase is the crucial basis for each project. Despite today's common practice, according to which this is done by a small group of people, harvesting collective intelligence could help to avoid redundant research and to improve instead the design of the study. As the complexity and scope of scientific projects are increasing, the application of project management tools can be useful for managing the processes and parties involved.

Experiments and data generation

Today, data generation in academic research relies comparatively strongly on manual labor. While this is mostly due to the low price of labor force resulting from the academic system and the limited interdisciplinary eduction of science and engineering, the potential of automation is extremely high but mostly neglected. Not only could the efficiency of invested labor be improved, but also reproducibility could be significantly increased. To make this affordable for the broader research community, a shift from siloed proprietary devices to well-documented pieces of standardized, open-source hardware developed by the scientific community itself is needed and could take place in cooperation with potential vendors. Open hardware platforms like Arduino could be starting points for such a development. The devices could and should enrich the primary data with further metadata, convert them into semantified formats and directly upload the output into online repositories.

One promising example which visualizes the potential of such automation of usually quite labor-intensive research is the robot scientist ADAM [2]. The streamlining of mechanical steps and the evaluation of results would benefit from formal languages that describe the necessary procedures and make the design and exchange of experimental setups easy [3]. As a long term goal, scientists would mostly engage in programming experiments and engineering the system to automate those steps that have been performed manually so far. The motto "work on the system, not in the system" should guide this development.

Data release

The online release of experimentally generated data should be done shortly after the generation and can happen potentially in real time. Downstream analysis inside of the research project but also the reuse by other parties should be kept in mind when selecting data formats. These should, as far as possible, be non-proprietary, machine readable (semantically enriched) and common for the respective domain of research. If no format fulfills all these requirements, the conversion into alternative formats should be permitted. Access to the data could take place via a web interface or domain specific clients. Especially for large or highly accessed data sets, the retribitution via peer-to-peer networks is recommended.

Data analysis

Since every step in the data analysis should be transparent and easily reproducible, it should take place preferably in the proposed platform, too. Systems like the analysis workflow tool Taverna [4] could be used for such processing. Already today, many research institution offer grid computing infrastructure for such purposes. Analyses using external tools, especially GUI-tools that do do not offer any possibility to log the performed actions, should be avoided if possible, as otherwise documentation has to be created manually. A potential side-effect of running computationally intensive analyses on shared systems is a more economical usage of the needed infrastructure.

As done for the raw experimental data, the result of the data processing should be documentend and stored in repositories to be accessible.

Knowledge generation

The result of the processing as well as the raw data can be used by scientists - or robots [5] - to draw conclusions and to generate knowledge out of the available information in a well documented way. The platform should assist to make this happen collaboratively by offering commenting and rating of statements. Discussions - text, audio- and/or video-based - should be recorded to make the path to finding reconstructible.

Final Publication

As documentation of every step is an inherent feature of the workflow, the final publications resulting from a study can be short reports linking to the major outcomes and putting them into the scientific context. The platform should offer functionalities to perform open peer-review of this final report.

Implementation

Technology

As shown above, the many building blocks of a complete scientific workflow already exist and only need to be connected seamlessly. The development of open standards defining the required interfaces of these parts could enable different parties to assemble the pieces into a consistent workflow and to add further needed parts. This would offer the possibility to implement a platform either as one monolithic application or as separate interacting and exchangeable units.

Funding

Of similar importance as the technical realization is the adaptation of scientific culture and funding policies. While research institutions like the National Institutes of Health (US) or the Welcome Trust (UK) already require open access for final peer-review manuscripts that results from research they funded[6] [7], the regulations are much weaker for the underlying data, and almost nonexistent for proper annotation.

Licensing

As the default copyright restrictions in most jurisdictions hamper the reuse of data, it is highly desirable that, with very few exceptions, each entity generated in the research process is explicitly published under a less restrictive license e.g. the ones offered by Creative Commons [8] or is released into the public domain (as the public domain is not a concept in every country CC0 [9] could be applied equivalently).

Reputation

The gain of reputation is the most important incentive for scientist and is currently mostly expression by the publication in scientific journals. As every contribution to a research project can be attributed to a distinct person and could be rated by others, the allocation of reputation is an inherent element of the proposed platform. The connection to research identifier like ORCID [10] and the analysis of such microcontributions could assemble a precise image of a scientist's skills and achievements.

Challenges

As stated above, considering the allocation of reputation and funding in science is crucial when redesigning scientific processes. To bridge a transient phase until the suggested political changes have taken place, fine granular access control in the research workflow platform could permit that the technology is adapted by scientist despite objection regarding the loss of reputation. With such a control in place the full process could be opened up after the final publication or at any other wished time.

It is very likely that there will by one single platform that can fulfill the requirement of each scientific domain. Instead for every community a that might be deduced from and inherit and might inherit feature a more general solution.

References


Brainstorming

Introduction

  • Seamless transition from bench to publication
  • "The unspoken rule is that at least 50% of the studies published even in top tier academic journals - Science, Nature, Cell, PNAS, etc… - can’t be repeated with the same conclusions by an industrial lab." (source: Derek Lowe quoting Bruce Booth). Derek Lowe points to slightly different types of bias on the academic and the pharma side and that both have to be taken into account. For an argument on how large multi-center clinical trials are too expensive to reproduce and hence are "one-shot sience" that by its irreproducibility does not conform to the principles of science, see James Robert Brown. One-Shot Science. In: The Commodification of Academic Research. Science and the Modern University. Edited by Hans Radder. Pittsburgh: University of Pittsburgh Press, 2010, 90-109. Publisher's page with free TOC and OA Chapter 1

The scientific workflow

Conception

Experiments and data generation

  • maybe not possible for all subfields but too often neglected
  • increases/decreases
    • Reproducibility
    • Efficency
    • Documentation (Metadata to be generated automatically; e.g., temperature, pressure)
    • Semantification
  • needed:
    • formal language to describe experiments
    • open standards / interface definitions for software and hardware (e.g., Arduino-based approaches)


Data release

Raw data are made accessible in online repositories directly after creation in machine-readable formats (could be real-time if produced in automated fashion).

Upload and access:

  • web interface
  • domain-specific clients
  • distribution via P2P networks - especially for large datasets

Data analysis / Knowledge generation

  • Collaborative analysis and conclusions
  • myexperiment.org, Taverna [1], Galaxy
  • Processing of data, e.g., combination with other data or visualization should be included in the workflow. E.g., R or Python scripts could be uploaded to the platform and executed. The output created in GUI based tools which do not provide the processing steps should be avoided but could also be uploaded if properly documented.
  • Possibility to comment and rate raw data and conclusions, and other comments.

=> Full transparency about the path from the start to the conclusions

Final Publication

  • Less elaborate than done in the whole process - just summarizing wrap-ups with links to selected conclusions
  • Peer-review is an integral part of open collaboration, and, to boot, open peer review.

Implementation

  • full and feature-rich WYSIWIG editing

Technology

Funding

  • Open standards / APIs needed
  • Common FLOSS CMSs like Plone and Drupal could be used as basis for such platforms

Reputation

Licensing

Possible hurdles

  • sensitive medical
  • Diversity of research fields and their specific requirements
    • fragmentation should be avoided
    • top down approach - ontology / inheritance
    • collaboration habits need to be retuned to fit multicultural setting
  • Acceptance
    • radically open might repell
    • closed and open in parallel possibly good enticement
    • easy switch from closed to open essential
    • fine-grained governance models help authors adapt to open collaboration
  • Platforms need granularity in the licencing and visibility to be flexible during different phases and also to be able to adapt to institutional requirements in different jurisdictions (eg. NZGOAL does not have a CC0 option) and researcher's personal choices (easier to get collaborators to agree to a CC-BY than a CC0)
  • D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn, http://www.ncbi.nlm.nih.gov/pubmed/16845108 ["Taverna: a tool for building and running workflows of services.,"] Nucleic Acids Research, vol. 34, iss. Web Server issue, pp. 729-732, 2006.