Rethinking Project Organization

How do you organize your research projects?

For the last three years or so, I have adopted a project organization structure along the lines recommended by William Noble in PLoS CB: A Quick Guide to Organizing Computational Biology Projects.

The basic organizational idea here is that each project get’s its own directory. The top level of that directory is organized categorically. In my case I have Doc, Data, Results, Code and sometimes some other things. Code is version managed, and split into Functions (code that takes arguments and will be reused repeatedly) and Scripts code I intend to run only once. Since Scripts rapidly gets very large, and is run only once, maybe it shouldn’t be version managed, but I do occasionally re-run or write over scripts. Also version management acts as a backup if I actually push the versions somewhere (which I don’t at present), and a fail-safe against accidental deletion and overwriting of files. I recently added a third folder to Code for a number of projects called FigureCode. These are Scripts that start off in the Scripts folder, but once they are running and producing final versions of figures for that day which I am planning to present somewhere, these functional Figure producing scripts are saved in a new folder. All Scripts and FigureCode have the date (yymmdd) appended to the end of the filename.

I’ve encountered several problems having this system work with my needs. I take the time here to reflect on these problems and postulate some ways forward.

Big data projects

My projects folders live on in My Documents on C:\, whereas my data is stretched across (at present) 19 different external hard-drives (the most recent of which have 32Tb capacity). The data goes on these hard-drives as some function of the date on which it was collected, but get shuffled around to fill dead space (e.g. drive 10 has 500 GB free but tonight’s imaging session requires 600 GB but a few weeks later I have a 500 GB set that will just fit. Or I develop a cool compression algorithm for some data and decide some other data was a failed run and delete it, also opening up space on older disks which then gets filled). Consequently these data addresses aren’t permanent upon creation.

Ideally I would like the directory for the project to have links to the relevant data folders. Maybe this just requires more manual overhead to make sure I link and re-link files to the Data folder of the project — but that still sounds like a recipe for trouble.

It would be awesome instead to just tag data with a project name and have that automatically hook the data into the correct project folder. Maybe if my data folders include categorical tags (e.g. in the filename), I could write a little Daemon that looks through the data hard-drives and the data folders and updates all the links in the data-folders. This would be great to improve the tracking of datasets. One substantial challenge would be that some of the hard-drives go offline (to free up USB ports etc) and the Daemon would need to be able to tell did this move somewhere else or is it just disconnected. I think I could do it though — any file for which it can’t find a new link it just leaves the old link intact. Also need to deal with Windows assigning different drive letters to disks as you plug them into different ports (even though the folder has a common name). I’m sure I can write something smart enough to ignore the drive letter (search all) and just use the hard-drive disk name (e.g. Alistair2) though.

Folder names — dates vs. labels

William Noble uses categorical labels at top directory level, with date labels at the next level (e.g. Results\2014-02-14\). The date labeling is great, I like the alpha-numeric sorting by date, (year-month-day is a great idea), especially since date modified indexing reflects recent modifications note date results / data were taken, which is what I want. But a big folder full of dates is actually not too transparent — I stare at these and say which date did I take that data? and then flip through a bunch of them. Sometimes I get frustrated and go look up the date in my electronic lab notebook, but it’s nicer to have an at-a-glance system.

As a work around I have adopted a mixed date / categorical structure for this second level of organization, where some short tag is appended to the date to help trigger my memory about what it was: Results\2014-02-14_PHmutPSC.

Work not related to a specific project

I also spend a decent amount of time working on code that is not for a specific project (maybe that says something about my scientific efficiency). The idea though is that these additional coding projects make multiple aspects of my research faster more accurate and better documented. For example, several of my projects use STORM as an imaging method, and all my projects use Matlab to analyze data, so in collaboration with Jeff Moffitt (a fellow post-doc in the lab) I write code for a git repository we call matlab-storm. I also have a bunch of Matlab functions from my PhD work which I called Image-Analysis. These work alright as projects in their own right, since the scope is well defined. Though maybe they shouldn’t be since they aren’t really leading towards separate publications, though I suppose they could if we found the time to write it up.

A bigger challenge has been all the little functions I write to help with one project, which I then find myself re-using for other projects. My current (trial) solution for this is taking me further away from the basic projects organization I started with.

Master Toolbox Approach: matlab-functions

I recently started a new git repository called matlab-functions. As the name suggests, the goal of this project is to collect a large family of functions in one place. This way I don’t have to remember when I want to use that cool new function I wrote last week, if it was part of the Chromatin project Functions or the Ph Polymerization project Functions or whatever. The second and more important motivation was it makes it easier to share these functions. When it’s just me it’s not hard to just have all folders that contain custom functions added to my filepath via Matlab startup.m.

I’m also experimenting with more subfolders to organize these functions. These are reasonably sepecific — for instance, qPCR, BLAST and DeepSequencing are separate subfolders instead of having Genome Tools. More subfolders has the advantage of making it easier to understand what the project contains, since files are now grouped and organized. I guess ideally I would just tag files rather than put them in different subfolders and Windows Explorer window (or Nautilus / GNOME files on my Linuxbox) could display a list of tags instead of a list of folders, but operating systems seem to be behind the curve on this one.

This repository is now a private repo on Git with my research collaborators added as collaborators.

Old vs. new projects

As I accumulate projects in my Projects directory, it would be nice to see just the active ones (sort by date does okay here). Some of these projects are all published, pretty much wrapped up, and I haven’t done further work on them in years. Some are projects I started and abandoned. And some of course are the active ones that I hop between all the time.

This entry was posted in Research Planning. Bookmark the permalink.