Class 22: Organizing your files

Methodology of Scientific Research

Andrés Aravena, PhD

May 16, 2024

The goal of Science is to produce and communicate new knowledge

The key word here is communicate

What is the value of a result that is not made public?

We communicate with our collaborators

Most of research is done in teams

Good practices help teamwork, by:

  • Keep track of what was (or was not) done
  • Coordinate next steps
  • Avoid work duplication

…but I work alone…

Even if we work alone, we are still communicating

  • with your supervisor or advisor
  • with the referees of your paper
  • with other scientists that read (and cite) you
  • with the next Ph.D. student in your lab
  • with the general public
  • with our future self

Each one of these interactions can improve following a good practice

Communicate with your supervisor

Research results are not enough

You must convince your boss (and the jury) that you deserve to be called “Doctor”

  • Make your work easy to understand

  • Make clear what is your original contribution

…with the referees of your paper

Referees are busy people who works for free

  • Give them all they need to replicate and validate your work

  • Being clear and transparent helps them to decide fast

You will get published faster
(or at least get good feedback)

…with other scientists in your field…

…that will read your paper (and hopefully cite it)

The game does not end when you publish

50% of papers are read only by the referee

  • Make your work easy to understand and replicate

Evans, J. A. (2008). Electronic Publication and the Narrowing of Science and Scholarship. Science, 321(5887), 395–399.

…with the general public

Eventually, your work will have an impact outside academia

(the end goal is to make a better world, no?)

We need to be aware of the ethical implications

  • Access, licensing, copyright models
  • Privacy concerning test subject
  • Truth and academic integrity

…with your future self

Nothing is more frustrating that reading your old work

As they say: “The past is a foreign country”

Undocumented code/protocols are hard to understand…

and you can only blame yourself

Prepare your files for the next user

Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why

 

The ideas of this section are mostly based on
William Stafford Noble. “A Quick Guide to Organizing Computational Biology Projects.” PLoS Computational Biology 5, no. 7 (2009): 1–5. https://doi.org/10.1371/journal.pcbi.1000424.

This “someone” could be

  • someone who wants to try to reproduce your work,
  • a collaborator who wants to understand your experiments,
  • a future student in your lab extending your work
    • after you have moved on to a new job,
  • your research advisor evaluating your research skills.

Most commonly, however, that “someone” is you.

William Stafford Noble. “A Quick Guide to Organizing Computational Biology Projects.” PLoS Computational Biology 5, no. 7 (2009): 1–5.

Everything you do, you will probably have to do over again

Folder structure for data projects

Role of each folder

  • docs is where you write your paper/talk/thesis
  • data is anything that you get from outside the computer
  • results is what your code produces
  • code is where you write your code
  • bib to store documents cited in your document
    • if it has a doi, it goes here
    • bibliographic database goes here
  • extra for other documents without doi

Use a script to build the structure

Cookiecutter is a python tool to create new projects

You can find search for recipes in GitHub with a query like topic:cookiecutter topic:r

Raw Data is Sacred

Producing data is expensive and time consuming

You don’t want to lose it. Mark it read only immediately
(and make backups)

Never modify raw data. Use a script to make a clean version

Use folders raw and clean inside data/YYYY-MM-DD
Code for that in scripts

Each folder needs a README file

Good filenames help a lot to understand the project

But they are usually not enough

A README file in each folder can explain the purpose of each file

It takes time to write them, but it saves time in the long run

Define your projects

What is a “project”?

We can distinguish four categories

  • Projects with well-defined goals and deadlines, e.g. a thesis
  • Areas that are permanently active, like “health” or “family”
  • Resources that can be useful for several projects, like code libraries, or general interest papers
  • Archives, anything that is no longer active. Can be copied to external media and stored out of the computer

Each one requires a separate folder

Tiago Forte Building a Second Brain, Simon and Schuster, 2022

Spaces

Personally I like to group my Projects/Areas/Resources/ Archives by major topic

  • Teaching
    • Each course is a project
  • Research
  • Work
    • Contracts, bureaucracy
  • Personal
    • Health, Bank, Travel, Family
  • Learning
  • Hobby

Filenames

Be coherent when choosing filenames

Decide when to use ., -, and _

Avoid spaces in filenames

Either John-Smith.txt or John_Smith.txt

Usually . separates filetypes, like .csv or .yml

Define a standard with your collaborators

Check periodically that you are following your standard
(maybe with a script)

Examples

Bad Example

1-Introduction.docx
2_Methods.docx
3.Results.docx
4 discussion.docx
10-conclusions.docx
results-01-03-09.txt

Good Example

01-Introduction.docx
02-Methods.docx
03-Results.docx
04-Discussion.docx
10-conclusions.docx
2009-01-03-results.txt

Another Good Example

01_Introduction.docx
02_Methods.docx
03_Results.docx
04_Discussion.docx
10_conclusions.docx
20090103results.txt

Both are good, but use only one

Write dates as YYYY-MM-DD

  • When was 8/3/1965? August or March?

  • Is today 6/10/2023 or 10/6/2023?

It is better to write YYYY-MM-DD. This is an ISO standard

There is no ambiguity of meaning

Sorting alphabetically, numerically, and chronologically give the same result

Collaborating

Collaborating

Sharing Word documents by email is a VERY BAD IDEA
It leads to chaos and confusion

Use an Online service

You can share your document via Dropbox or Google Drive

You can edit online using Microsoft Office 365 or Google Docs

Several people can work in the same document at the same time

Advantage: better spelling and grammar correction

But they require a permanent internet connection

Where to store it

  • In the server only

  • Cloud drive like Dropbox, Google Drive

    • Good to share large data and non-text files
    • Bad if two people changes the same file
    • Works better with permanent internet access
  • Version control system like GitHub, GitLab, Bitbucket

    • Good for text and code, bad for big files
    • Keeps history
    • works well without internet access

Never use Git in a shared folder

It can easily become corrupt

Sharing

  • Hybrid, using symbolic links

  • Or use an online editor

    • Google Docs
    • HackMD.io
    • Overleaf