GitHub: An essential file repository in the scientific community

The fast-growing technological change and the need to innovate quickly has yielded collaboration requirements among groups coming from different places of the world.  For instance, in large-scale worldwide scientific research projects, it is important to share all the information and keep it updated. The know-how relies on cloud-based systems, not only to coordinate and manage the information, but also to handle the computational efforts at different strategic peaks during the process of the project.   “Ancient” resources such as offline-working cannot maintain this kind of development.

Google Drive and Dropbox are both currently the bestl-known cloud file storage services. The use of these platforms has heavily increased but, in the scientific field, another file repository has stepped forward: GitHub. This file hosting platform provides several collaboration features such as task management and features requests for every project, among others. It has become the main platform of the open-source development community that many scientists have begun considering it for a conventional file repository.

At first glance, it seems difficult to use – maybe it is the first time you have heard about GitHub, or never had the chance to try it!- but once you start using it, you will not replace it by any other file storage. Following carefully the basic steps, you will master it:


– Steps needed before getting started

1. Download the Git bash freeware application. Using few command lines, this application helps you to download/upload files from your working directory located in your desktop to the corresponding directory in the cloud.

2. If you do not have a github account (for example, username: freshbiostats), create it here. This site will be your working place in the cloud. The https addres of your site will be (for example): https://github.com/freshbiostats.


– Steps needed during the working process


3. Once you have a github profile, it is highly recommended to create a new repository (i.e., new folder) for each project you are working on. Choosing an adequate name (for example: fresh) describing the content of it, your files will be uploaded there so that you can share all the information with other people working on the same work assignment. These repositories are public by default, but you have the choice by making them private.


4. In your computer,  make a directory, for example, named fresh. Using Git bash and via cd command, go to the created folder.  For simplicity, this directory will be called from now as master.  After it, type

git init

This initializes a git repository in your project. It means that you have entered in the github world and you are ready to work with this system.


5. It is time to create files and upload them to the cloud. In case there are no files in the master folder, and you have to create a document to upload to the web-storage system. To do that:

git status          this command lists out the content of the folder


In case the master directory is empty, and you want to create a file “README.doc”, with a sentence “Hello World”:

 touch README.doc “Hello World”   


Now, once the file is created, the next step is to upload to the web. To this end, execute:

git add README.doc                          it stages the document to the directory

     git commit  -m  “Readme document”        it adds a label to the document.

     git remote add origin https://github.com/freshbiostats/fresh.git


The last command line assigns the destination folder where the file is to be  uploaded. It creates a remote connection, named “origin”, pointing at the GitHub repository just created.

6. After that, typing:

git push -u origin master

With this command line, you send all the files added to the master (called as fresh) to the repository located in the web (https://github.com/freshbiostats/fresh.git). AND…THAT’S ALL! The file is already in the cloud!.



Now, it is your turn, it is worth a try: GitHub, commit and push… !!


…a scientific crowd

While researching on scale-free networks, I found this book, which happens to include the very interesting article The structure of scientific collaboration networks and that will serve me as a follow-up to my previous post on social networks here.

Collaborative efforts lie in the foundations of the daily work of biostatisticians. As such, the analysis of these relationships –lack of interaction in some cases- appears to me as fascinating.

The article itself deals with the wider community of scientists, and connections are understood in terms of papers´ co-authorships. The study seems to prove the high presence of small world networks in the scientific community. However short the distance between pairs of scientists I wonder, though, how hard it is to cover that path, i.e., are we really willing to interact with colleagues outside our environment? Is the fear to step out of our comfort zone stopping us from pursuing new biostatistical challenges? Interestingly, one of Newman´s findings amongst researchers in the areas of physics, computer science, biology and medicine is that “two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.”

Interaction patterns analyzed through social networks diagrams like the one shown in Fig 1., can give us a hint on these patterns of collaboration, but can also be a means towards understanding the spread of information and research in the area (ironically, in a similar fashion to the spread of diseases as explained here). sociogram_biostatistics

Fig.1. Biostatistics sociogram (illustration purposes only; R code adapted from here and here)

In my previous post on the topic, I focused on the great Linkedin inmaps. I will be looking this time at Twitter and an example of the huge amount of information and the great opportunities for analysis that the platform provides. R with its package twitteR makes it even easier… After adapting the code from a really useful post (see here), I obtained data relating to twitter users and the number of times they used certain hashtags (see plots in Fig. 2).


Fig.2. Frequency counts for #bio (top left), #statistics (top right), #biostatistics (bottom left), and #epidemiology (bottom right). Twitter account accessed on the 17th of May 2013.

Although not an exhaustive analysis, it is interesting to notice the lower figures for #biostatistics (turquoise) and #statistics (pink), compared to #bio (green) and #epidemiology (blue) for example (please notice the different scales in the y axis for the four plots). It makes me wonder if the activity in the field is not our strongest point and whether it would be a fantastic way to promote our profession. I am certainly convinced of the great benefits a higher presence in the media would have, particularly in making it more attractive for the younger generations.

That was just a little peek of even more exciting analysis to come up in future posts, meanwhile see you on the media!

Do you make any use of the social networks in your work? Any interesting findings? Can´t wait to hear them all!