Paper preprints: making them work for you

Pilar Cacheiro & Altea Lorenzo

After reading the news that PLOSGenetics is to actively solicit manuscripts from pre-print servers (PPS; read here) as a way to “improve the efficiency and accessibility of science communication”,  we decided to write a quick overview on some of the most popular repositories.

Arxiv  is probably the most widely known PPS as it has been available since 1991 and it mainly covers publications from the fields of mathematics, physics and computer science. Although a later development (2013), popularity of the PPS for biology  bioRxiv is rapidly increasing, particularly in the fields of genomics and bioinformatics (see post here). SocARxiv for the social sciences is even more recent (July 2016), so we still have to wait to see how it is received by the community.

Some other repositories extend this feature to incorporate additional information. On Figshare, for instance, researchers can freely share their research outputs, including figures, data sets, images, and videos. GitHub, although mainly focused on source-code, also offers similar utilities (read previous posts here and here)

The main advantage of these PPS is the speed at which you make your  work available to the scientific world therefore maximising the impact and outreach. Additionally, they allow for suggestions and comments from peers which make the process a more interactive one.

At a time when research consortiums are starting to require submission of manuscripts to online PPS ahead of peer review ( 4D Nucleome being a prominent recent example as reported on Nature news), and even governmental agencies (e.g., National Institutes of Health in US; read here) are enquiring about the possibility of allowing preprints to be cited in grant applications and reports, pre-prints are bound to play a big role in scientific research dissemination.

Related tools such as OpenCitations that provides information on downloads or citations of these pre-prints and Wikidata that serves as open data storage, are other examples of resources framed within the Creative Commons Public Domain philosophy of free open tools, and that will surely have a positive impact on efforts towards guaranteeing reproducibility and replicability of scientific research (read about a recent paper reproducibility hack event here).

We are keen to give it a try, are you?


Increasing your research visibility -through a personal website with GitHub-

Letting the world know about your work is important. This is true for any field of research, but for those of us working in any kind of computing-related field, it becomes even more relevant.

Professional networks, distribution lists, blogs and forums are an excellent way to keep up to date with research in your field (you can check a previous post on this topic here). Even social networks can be a great way to do so, depending on how you decide to use them. For instance, you can find quite a number of PostDoc positions posted on twitter.

There is another important aspect to consider as well. By being an active part of the community, not only do you gain visibility, but you also get the chance to contribute in return… How many issues have you solved through StackOverflow, Biostars, R-bloggers posts and so on? How many articles were you able to get through ResearchGate?

Back to the matter at hand, I have just recently found out about how to create a personal website with GitHub . If, like me, you do not know much about HTML programming, you might want to use the automatic page generator. I can tell you: if you have already figured out the content you want on it, and providing you already have a GitHub account, it will take you less than 5 minutes, (see a previous post on how to create an account here). Starting from a simple static web -simple is beautiful-, it can evolve into more sophisticated sites.


GitHub: An essential file repository in the scientific community

The fast-growing technological change and the need to innovate quickly has yielded collaboration requirements among groups coming from different places of the world.  For instance, in large-scale worldwide scientific research projects, it is important to share all the information and keep it updated. The know-how relies on cloud-based systems, not only to coordinate and manage the information, but also to handle the computational efforts at different strategic peaks during the process of the project.   “Ancient” resources such as offline-working cannot maintain this kind of development.

Google Drive and Dropbox are both currently the bestl-known cloud file storage services. The use of these platforms has heavily increased but, in the scientific field, another file repository has stepped forward: GitHub. This file hosting platform provides several collaboration features such as task management and features requests for every project, among others. It has become the main platform of the open-source development community that many scientists have begun considering it for a conventional file repository.

At first glance, it seems difficult to use – maybe it is the first time you have heard about GitHub, or never had the chance to try it!- but once you start using it, you will not replace it by any other file storage. Following carefully the basic steps, you will master it:


– Steps needed before getting started

1. Download the Git bash freeware application. Using few command lines, this application helps you to download/upload files from your working directory located in your desktop to the corresponding directory in the cloud.

2. If you do not have a github account (for example, username: freshbiostats), create it here. This site will be your working place in the cloud. The https addres of your site will be (for example): https://github.com/freshbiostats.


– Steps needed during the working process


3. Once you have a github profile, it is highly recommended to create a new repository (i.e., new folder) for each project you are working on. Choosing an adequate name (for example: fresh) describing the content of it, your files will be uploaded there so that you can share all the information with other people working on the same work assignment. These repositories are public by default, but you have the choice by making them private.


4. In your computer,  make a directory, for example, named fresh. Using Git bash and via cd command, go to the created folder.  For simplicity, this directory will be called from now as master.  After it, type

git init

This initializes a git repository in your project. It means that you have entered in the github world and you are ready to work with this system.


5. It is time to create files and upload them to the cloud. In case there are no files in the master folder, and you have to create a document to upload to the web-storage system. To do that:

git status          this command lists out the content of the folder


In case the master directory is empty, and you want to create a file “README.doc”, with a sentence “Hello World”:

 touch README.doc “Hello World”   


Now, once the file is created, the next step is to upload to the web. To this end, execute:

git add README.doc                          it stages the document to the directory

     git commit  -m  “Readme document”        it adds a label to the document.

     git remote add origin https://github.com/freshbiostats/fresh.git


The last command line assigns the destination folder where the file is to be  uploaded. It creates a remote connection, named “origin”, pointing at the GitHub repository just created.

6. After that, typing:

git push -u origin master

With this command line, you send all the files added to the master (called as fresh) to the repository located in the web (https://github.com/freshbiostats/fresh.git). AND…THAT’S ALL! The file is already in the cloud!.



Now, it is your turn, it is worth a try: GitHub, commit and push… !!


Side effects of open health data

Recent improvements in the technology to record data, have coincided with calls for making this data freely available. Health related studies are a particular case in point.

At the forefront of these changes, reputable publications have taken measures to set transparency standards.  Since January for example, the British Medical Journal “will no longer publish research papers on drug trials unless all the clinical data on what happened to the patients is made available to anyone who wants to see it.” (Significance magazine Volume 9 Issue 6, December 2012)

In a sector that has often been accused of secrecy, GlaxoSmithKline are also engaged in this spirit of openness. They recently announced that they would make available “anonymised patient-level data from their clinical trials” to researchers “with a reasonable scientific question, a protocol, and a commitment from the researchers to publish their results” (ibid).


Fig 1. Death rates per 1000 in Virginia (USA) in 1940, VADeaths R dataset (for illustration only)

However, in the past few weeks two stories seem to challenge this trend towards greater transparency. At the same time as rumours grow in the UK of cuts in the publication of well-being data (The Guardian, 10th of July 2013), controversy has arisen regarding the recently released National Health System (NHS) vascular surgeons individual performance records (BBC News 28th of June 2013) .

While the measure has been welcomed by some sectors of the general public, there have been important criticisms coming from the medical side. Several  doctors within the speciality, with perfectly satisfactory records, are refusing to agree to the metric. The argument is that different types and number of procedures coupled with the variability of prognoses make published indicators such as death rates misleading to the patients.

In general, calls have been made for further research into performance indicators that ensure information provided to the end-users is efficacious. As an example of this, back in 2011 when the first attempts to publicise this kind of information started, Significance magazine (Volume 8 Issue 3, September 2011) reported as one of the causes for the lack of success, failure to agree on “which indicators to use”, and also mentioned “discussions with the Royal College of General Practitioners to establish the most meaningful set.”

Tensions between opening up areas of genuine interest to the widest audience and ensuring that there are not unintended side effects, are a societal challenge in which statisticians can play a vital role: sometimes numbers cannot speak for themselves, and appropriate interpretations might be required to avoid wrong conclusions. This becomes particularly important when dealing with health issues…

Note: in Spain, it would seem that there is still much work to be done in terms of open data…. A PricewaterhouseCoopers report (pp. 120 -131) highlights the issue as one of the ten hot topics in the Spanish Health System for 2013, and welcomes the creation of the website www.datos.gob.es as one of the first steps towards openness in this and other sectors.

What are your thoughts on this issue? Are there any similar measures being taken in your country or organisation?