Project Templates
By Daniel Chen
May 30, 2017
Project templates provide some standardized way to organize files. Our lab uses a template that is based off the Noble 2009 Paper, “A Quick Guide to Organizing Computational Biology Projects”. I’ve created a simple shell script that automatically generates this folder structure here, and there’s an rr-init project by the Reproducible Science Curriculum folks.
The structure we have in our lab looks like this:
project
|
|- data # raw and primary data, are not changed once created
| |
| |- project_data # subfolder that links to an encrypted data storage container
| | |
| | |- original # raw data, will not be altered
| | |- working # intermediate datasets from src code
| + +- final # datasets used in analysis
|
|- src / # any programmatic code
| |- user1 # user1 assigned to the project
| +- user2 # user2 assigned to the project
|
|- output # all output and results from workflows and analyses
| |- figures/ # graphs, likely designated for manuscript figures
| |- pictures/ # diagrams, images, and other non-graph graphics
| +- analysis/ # generated reports for (e.g. rmarkdown output)
|
|- README.md # the top level description of content
|
|- Makefile # Makefile, if applicable
|- .gitignore # git ignore file
+- project.Rproj # RStudio project
In the main level there are data
, src
, and output
folders.
As well as *.Rproj
, .gitignore
, and potentially a Makefile
files.
The .Rproj
file
Since we are primarily an R lab that runs an RStudio Server server,
we use the Rproj files organize the various projects.
There are a few benefits to this.
When using the RProj
file,
it sets the working directory in RStudio to the location of the Rproj
file automatically.
This makes the project more reproducible by avoiding the setwd()
command in R,
and since multiple people work on the same project, referencing other people’s source code and data outputs
all stem from a common location.
The .gitignore
file
The .gitignore
file is there to ignore various outputs from the src
code.
This includes things like .pdf
or .html
output from knitr
and rmarkdown
documents,
as well as things in the first level of the data
folder.
In general,
the files and folders in the .gitignore
file are things that can be reproduced/regenerated by running code from the src
folder.
The src
folder
The src
folder contains all the analysis and code for the project.
It should only contain the code for the project and not any kind of output from the code, i.e., data, reports, etc..
Since all the projects int he lab are separate git
repositories,
each person working on the project creates a separate folder with his/her user name (e.g. user1
, user2
) under the src
directory to minimize
potential conflicts within the code.
The output
folder
Is there for any type of ‘final’ non-data output.
The useage is ambiguous on purpose,
but typically is used for some kind of plot or table that will be used in a final publication or report.
the figures
, pictures
, and analysis
subfolders are just placeholders about what could potentially be placed in the folder,
users have the freedom to adapt the contents to the project at hand.
Things in the output
folder are, by default,
ignored since the they should be able to be re-created with one of the src
scripts.
The only thing that should not be in the output folder are any datasets.
Those should all be under of the the data
subfolders described below.
The data
folder
Since the data folder is part of the code repository, (i.e., it comes when you git clone
the repository),
the contents of the folder are, by default, ignored in the .gitignore
file.
Additionally, because of data privacy concerns, all of our project data are on separate (encrypted) LUKS (Linux Unified Key Setup)
partitions.
The data
folder contains a shortcut to the relevant encrypted data container.
This is one way to prevent data from being checked into the code repository,
and potentially leaving the server.
Within the encrypted data folder, there are 3 main folders: original
, working
, and final
.
The original
data are the rawest datasets available.
Typically theses are datasets we are given by sponsors,
or found online.
These datasets, in combination with the code in src
, should be able to regenerate any of the datasets in working
and final
.
Data provenance is the chronology of how data is transformed through the cleaning and analysis phase.
It’s important for reproducibility/reproducibility, and means
that original
data should never be altered directly.
original
data should only be modified by the code in src
.
Also, because the original
dataset is never altered,
and bugs or alterations in the code can be fixed without compromising the integrity of the dataset.
The working
data folder is mainly used for intermediate datasets.
For example, when a particular data step takes “a long time” to run,
the output of that datastep can be saved in the working
directory
and be used in a new R
script to resume any additional data cleaning steps.
The final
data folder is usually used for datasets that have been cleaned and ready for analysis.
No dataset is every fully cleaned, you can probably always perform some other data transformation on it,
but this folder is mainly reserved for datasets where an analysis, report, or plot is generated from.
Conclusion
Project templates provide a standard for one to share code with other people. With a standardized folder structures, a new member of a project can easily start to understand where the data, code, documentation, and results are.
It also makes code reproducible/replicable and provides a common location (working directory) to run the code.
Finally, because there are specifically designated areas for various components of a project, things become easier to find because everything is not simply placed in the same folder for “convenience”.
- Posted on:
- May 30, 2017
- Length:
- 5 minute read, 985 words