Why Computer Scientists Should Adopt Open Notebook Science

I wrote this essay using MediaWiki (formerly at the address http://stargrads.net/wiki) as a proof-of-concept to see if I can write an essay or a paper online. Since I’m removing MediaWiki from this web site, I have copied its contents below.


What is “Open Notebook Science”?

The term “open notebook science” was first coined by Jean-Claude Bradley in a blog post[1] to clarify and distinguish between several related concepts in the open access movement in science. Bradley defines the term to mean that

there is a URL to a laboratory notebook… that is freely available and indexed on common search engines. It does not necessarily have to look like a paper notebook but it is essential that all of the information available to the researchers to make their conclusions is equally available to the rest of the world. Basically, no insider information.

Open notebook science is thus analogous to open source software, but applies to all types of data and not just source code. It is also distinct from but related to open access publication, which denotes the public availability of preprints of journal articles. Bradley coined the term in order to properly describe the philosophy behind the UsefulChem[2] project.

The principle behind open notebook science is that anyone can access the primary research record of a project, including preliminary results of experiments, and even raw data as it is gathered. By this definition, open source software projects in computer science, such as those found on SourceForge[3], are also open notebook science projects if the “notebooks” are taken to mean not just the source code but also the web pages, documentation, forums, and other materials associated with these projects.

For the purposes of open notebook science, a “notebook” is thus quite broadly defined: it might be anything from scribbles which appear to be random gibberish to outsiders to somewhat polished half-baked thoughts. However, it should be noted that traditional notebooks are probably similarly broad in their diversity.

For computer science theorists, an open notebook might consist of a research diary where one records one’s thoughts on open problems and questions which one is working on.

I believe that computer scientists should lead the charge to adopt open notebook science, and in this essay I will explain why I hold this belief.

Examples of Open Notebook Science Projects

Perhaps the best way to illustrate the concept of open notebook science is by way of example. Many of the large and well-known open notebook science projects are in chemistry (cheminformatics) and biology (bioinformatics), disciplines in which it is typical to generate and work with large amounts of data. This data may be costly to produce in terms of money and/or time, and there may be a great deal of redundancy and reduplication between the efforts of different laboratories or research groups. Two examples of large open notebook science projects are OpenWetWare[4] and UsefulChem[5].

The goals of the OpenWetWare project are to “support open sharing of research, education, publication, and discussion in biological sciences and engineering”, according to their mission statement[6]. To this end, their web site provides blogs and wikis to users and research groups. The wikis are specialised for certain tasks, such as lab notebooks and hosted courses.

The UsefulChem project is an open notebook science project in chemistry originated by the Bradley Laboratory at Drexel University. It is hosted on Wikispaces, a web site that allows users to set up free wikis, and has an associated blog[7] hosted on Blogger.com, another free service. The UsefulChem project illustrates how researchers in fields outside of computer science are using readily available tools to create their open notebook science projects.

I was unable to locate any open notebook science projects of a similar focus and scope for physics and computer science. As note above, projects to develop open source software, hosted on publicly accessible code repositories such as SourceForge[8] or Google Code[9], might be considered examples of open notebook science projects if all documentation and other materials related to the projects are also accessible to the public. And the physics community is actually an early adopter of open access publishing, with its embrace of the arXiv[10] repository of electronic preprints (or e-prints) in 1991.

The quantum computing community in particular has two wikis, Quantiki[11] and Qwiki[12]. However, these wikis are intended to be references for researchers rather than notebooks for recording ongoing research.

Many individual researchers keep openly accessible notebooks, or have blogs on which they discuss their research semi-regularly. These are far too numerous to list, but there are a number of portals such as BlogScholar[13] in which research-related blogs are organised by category.

For researchers outside of computer science, technologies such as blogs and wikis are only tools to assist in their research. For computer scientists, however, these technologies and their impact on society are in and of themselves objects of study. It is therefore a little disappointing that open notebook science is not more widely practised in the computer science community.

Advantages of Open Notebook Science

The movement towards openness in science is based on the belief that sharing and cooperation leads to swifter progress than hoarding and competition.

Even leaving aside for the moment the advantages of openness, there are many benefits to using the latest note and record keeping technologies, such as wikis and blogs, whether or not they are made public. These include advantages common to all derivatives of digital text formats, such as the legibility of typed text as opposed to handwriting, the ability to copy, share, search, and archive notes, and the portability and extensibility of the notes unconstrained by the limitations of physical notebooks (or loose sheets of paper), which may be difficult to carry around and may be misplaced.

Technologies such as blogs and wikis also perform the role of versioning software, keeping a timestamped history of all edits, arbitrating between conflicting ones, and maintaining a record of all contributions and their contributors. Furthermore, these technologies also offer advanced search and organisational capabilities, such as the labeling of notes with tags, the arrangement of notes into a hierarchy of categories, and the inclusion of semantics through metadata. These search and record keeping capabilities would be very useful to scientists when the time comes to write grants and progress reports.

The use of digital media also means that notes may be dynamic and interactive. For example, instead of writing a mathematical formula, a digital notebook might contain a form for evaluating that formula with different inputs. A more trivial example of interactivity is that digital notebooks allow the reader to go back and forth between the main body of a text and a footnote[14], or between different pages or even different documents, very quickly, by clicking with a mouse or using keyboard shortcuts instead of fumbling with physical sheets of paper.

Making a notebook public brings several benefits in addition to the above. The sharing of notes is made much easier if one can simply pass around a URL (or DOI) instead of copies. An open notebook is available “everywhere” (limited by Internet access, which is ubiquitous at all research institutions anyway), and the notebook cannot be misplaced.

Many of the advantages of open source software also apply, mutatis mutandis, to open notebook science as well. An open notebook potentially has “many eyeballs” on it, allowing researchers other than the main investigators to submit “patches”, i.e., contribute slight tweaks or even major suggestions. This facilitates scientific discovery and creates more opportunities for better scientific conversations.

Considering the large number of open scientific problems and the diverse backgrounds of scientists, there are bound to be many instances in which the solutions to problems are known to or can be easily produced by researchers other than those who are actively examining them. Conversely, there may be discoveries which do not appear important in one field but which can bring about significant progress in another. Open notebooks allow minor questions or serendipitous discoveries to be made publicly known without having to attach them to a paper primarily about another more major (and possibly not very related) result. This is, in fact, what is already happening in several blogs maintained by scientists[15].

The open availability of “insider information” such as data, ideas, and detailed experimental procedures or proofs of theorems which have been condensed in published journal papers for brevity, allow these to be verified and increases the accountability of scientists and the transparency of science.

By performing their work in the open, it becomes possible for scientists to more efficiently allocate their resources, contributing their expertise where it is needed and summoning the expertise of others when that is required. The scientific community as a whole benefits from this openness. Keeping an open notebook also helps the individual scientist, because it increases his or her visibility in the major search engines.

There are benefits to the general public as well. The public accessibility of notebooks which are research diaries serves as a kind of real-time science journalism. This gives the public a closer look at the actual lives of scientists (which one hopes would dispel negative stereotypes), and helps to acquaint new researchers with what the life of scientists are actually like. The low cost of entry of starting a blog or a wiki might also serve to get young scientists, such as ambitious high school students, involved in the research process early.

Credit, Plagiarism, and Other Issues

The concept of open notebook science is not without its problems. The primary concern seems to be the issue of priority, or, in the common parlance, “getting scooped”.

Science has changed considerably since the days when scientists such as Galileo Galilei or Robert Hooke would encrypt their findings as Latin anagrams to establish their claim while concealing the actual contents of their discovery to give them a head start on research over their rivals. The history of science is filled with rancorous debates over who first came up with certain ideas, or whether an idea should be attributed to an originator who subsequently did very little with it or to someone who later developed and expanded on it.

Given the principle of “no insider information”, a scientist might be understandably concerned about intellectual property theft. There are, essentially, two ways to address the issue: social and legal.

As some commentators[16] have observed, there are already social norms in place which heavily discourage plagiarism and lack of proper attribution by scientists and scholars. This may be called the “French chef” approach to protecting intellectual property, after a study by Fauchart and von Hippel[17] who showed that the contents of recipes among accomplished French chefs are protected by a system of implicit social norms, rather than by law. Such a “norms-based” intellectual property system may deter would-be plagiarists and people who do not give proper credit. More public information about when each scientist came up with or worked on a particular idea should in theory decrease disputes about priority rather than increase them. When scientific conversations are timestamped, viewable to the public, and indexed and cached by multiple search engines and crawlers, it would be extraordinarily difficult to have honest disputes over the history of events related to a claim of priority. Social norms among scientists should encourage offers of collaboration and discourage “scooping”.

When norms-based deterrents against intellectual property theft are insufficient, there is always recourse to law-based systems. Creative Commons[18], for example, provides a number of legal tools for sharing intellectual property, including various free licenses with different stipulations[19] which may be attached to creative works.

An open notebook science web site or project may choose a license based on its specific needs and legal requirements. Most open notebook science web sites have licenses that allow sharing and distribution conditioned only on attribution. Others add a “share alike” clause, which allows derivative works conditioned on these being under the same, similar, or compatible license. There are also licenses which forbid derivative works, commercial usage, or both.

Some proposed solutions to the problem of intellectual property theft with open notebooks are technical rather than social or legal in nature, such as restricting information to registered users or introducing a delay in the notebook. However, such solutions violate the spirit of “no insider information”, and the result could not be properly called open notebook science.

There are several other intellectual property issues related to open notebook science, such as what constitutes prior publication for peer-reviewed journals, and premature disclosure for the purposes of obtaining patents. Many journals, however, already accept papers which have previously appeared as preprints in repositories such as the arXiv[20]. However, patent law has yet to catch up to many recent developments in technology.

The release of large amounts of data which have not undergone careful review raises a number of problems in trust and credibility. If a set of data is posted publicly before it is thoroughly checked, it may contain errors which may be propagated. However, as the open source movement shows, for active projects, bugs are usually caught and fixed quickly. Furthermore, it is not clear that reviewers of published papers necessarily check the data used to support the arguments of the papers they review very thoroughly (although, in principle, they should).

The use of data from open notebooks, then, imposes a duty upon the user to be vigilant and to take on some aspects of the role of reviewer. This, however, should be the case anyway, even with peer-reviewed published data. The recent Merck/Elsevier fake journals scandal[21] amply illustrates the point that even an established publisher of scientific journals should not be naïvely trusted.

Other criticisms of open notebook science deal with the user experience, whether from the point of view of a producer or a consumer of open notebooks. On the one hand, a scientist’s notekeeping style may consist of what may appear to others to be random scribbles, and keeping an open notebook would force him to clean up his spontaneous output and restrain his creativity. On the other hand, most of a scientist’s notes may not be very meaningful to others. But not all open notebooks are meant to be read as research diaries. Rather, search and organisational tools should allow relevant information to be found even in a sea of irrelevancies.

Finally, another criticism of open notebooks is their stability, or the ability to reference their data. But this is a problem that is not unique to open notebooks, and applies to all electronically published information. Solutions include stable URLs or DOIs, or the use of an archival service such as WebCite[22].

Proposed Process

I believe that computer scientists should adopt open notebook science not only for the benefits discussed above, but because it is a subject that is inherently a part of computer science. Computer scientists, as computer scientists, have a duty to explore technologies for archiving and communicating information as well as their social and legal implications.

I can’t very well advocate computer scientists to make their notebooks publicly accessible without doing so myself. This is why I created the ★grads.net web site (read as “star grads dot net”), as a platform in which to experiment with various aspects of open notebook science.

Currently, the web site consists of four software packages:

  • MediaWiki for taking notes and writing papers,
  • WordPress for keeping a research diary,
  • bbPress for collaborative discussions, and
  • Wikindx for bibliography management.

The software packages are integrated with one another, to the extent that this is possible to my knowledge at the moment. For example, it is possible to generate citations using the bibliography management system in both the wikis and the blogs. Furthermore, through the use of the jsMath package, it is possible to typeset mathematical equations using the widely known TeX syntax throughout the web site.

The intention is to use the blog to keep a diary of research activities, such as meetings or talks that I attend, or ideas that I have. These ideas may then be discussed in the blog comments, or if there is serious interest, in the forums. When an idea gets developed enough, a wiki page will be started for it. Some of these wiki pages may then evolve into papers, such as the one you are currently reading. Since this is an experiment, the roles of these tools may change over time.

One of the goals of the web site is to experiment with collaborative paper-writing. Currently, papers are written by passing around the document source through e-mail along with an indication of who is in possession of a virtual “token”. When a paper is sufficiently mature, a draft may be passed around to other researchers for comments. There is in principle no reason why the draft should not be publicly available at every stage for outside comments or suggestions. Wikipedia[23] has shown that this sort of collaborative process can result in documents of reasonably good quality.

As the history[24] of this paper shows, it began with a list of headings for the sections that I planned to write. The sections were then filled with rough notes, which were subsequently expanded into this paper.

Papers written in this fashion have not passed peer review, and thus should be treated like preprints on the arXiv or blog posts. Ideally, however, if enough experts participate in the commenting process, such a paper should have been reviewed quite thoroughly. This distributed evaluation process may be called “soft” peer review, as opposed to the traditional “hard” peer review.

At the moment, I am the only user on this web site capable of editing wiki pages, and hence of authoring papers. But the idea is that multiple authors can work on a paper, and contributors may be elevated to the status of co-author if they make significant contributions.

The contents of the ★grads.net wiki are under a Creative Commons Attribution-Noncommercial-No Derivative Works license. The “No Derivative Works” clause may seem a little restrictive, but as the license notes, any of its conditions “can be waived if you get permission from the copyright holder”. The purpose of this clause is to protect the contents of paper drafts until they are finished. Drafts should have the same social protection and status as if they had been written using more traditional means, that is, they should not be cited or used without the authors’ explicit permission.

(I will write out a more comprehensive copyright policy for each section of the web site when I get around to it.)

This paper is still a draft[25], and I invite your comments and suggestions, which I will take into consideration in revising it. There is an acknowledgements section at the end of this paper for listing contributors. The goal is to submit this paper somewhere for “official” publication after it is sufficiently polished[26].

For Further Reading

The Wikipedia article on Open Notebook Science actually isn’t bad and discusses many of the same points.

Acknowledgements

The author would like to thank Mina Razaghpour for her suggestions.

Bibliography

Notes

  1. ↑1 “Open Notebook Science”, URL: http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html
  2. ↑2 ↑5 UsefulChem, URL: http://usefulchem.wikispaces.com/
  3. ↑3 ↑8 SourceForge, URL: http://sourceforge.net
  4. ↑4 OpenWetWare, URL: http://www.openwetware.org/
  5. ↑6 OpenWetWare mission statement
  6. ↑7 UsefulChem blog, URL: http://usefulchem.blogspot.com/
  7. ↑9 Google Code, URL: http://code.google.com
  8. ↑10 ↑20 arXiv, URL: http://www.arxiv.org/
  9. ↑11 Quantiki, URL: http://www.quantiki.org/
  10. ↑12 Qwiki, URL: http://qwiki.stanford.edu/
  11. ↑13 BlogScholar, URL: http://www.blogscholar.com/
  12. ↑14 See? Click on the up arrow (↑) to the left to go back to where you were just reading.
  13. ↑15 See http://scottaaronson.com/blog/?p=112 for an example, and note especially the comments, and the pingbacks from later posts.
  14. ↑16 e.g., http://onthecommons.org/content.php?id=916, http://www.earlham.edu/~peters/fos/2006/09/attacking-plagiarism-with-cultural.html
  15. ↑17
  16. ↑18 Creative Commons, URL: http://creativecommons.org/
  17. ↑19 Creative Commons license selector
  18. ↑21 For details, see: http://www.the-scientist.com/blog/display/55671/, http://www.the-scientist.com/blog/display/55679/
  19. ↑22 WebCite, URL: http://www.webcitation.org/
  20. ↑23 Wikipedia, URL: http://wikipedia.org/
  21. ↑24 Unfortunately, this essay MediaWiki history was not preserved when it was copied a WordPress post.
  22. ↑25 Not any more.
  23. ↑26 Again, not any more.

0 Responses to “Why Computer Scientists Should Adopt Open Notebook Science”


  • No Comments

Leave a Reply