You should always act as if your data files will last much longer than you initially imagined. Data archaeology actually does happen, and one example comes from the field of gamma-ray bursts. The famous “March 5th” gamma-ray transient, now seen as a soft gamma repeater (SGR), was observed in 1979 by nine satellites in the solar system. Consensus on the March 5th event was reached almost two decades later, and part of the research that went in to it was the improved modelling ability of the 1990s. Once you could throw better software at a problem, you needed better information about the satellite construction than was commonly available. The designers of the gamma-ray burst detector (OGBD) had built a wonderful instrument and knew its specification and responses properties, but they did not predict that some day one could simulate the passage of radiation through the satellite bus. A couple of decades later people looking at this might find conflicting versions of the blueprints for a given satellite. They now had the ability to model radiation passing through the structural elements of the satellite, but could not find the definitive information of how that satellite launched – did they put in that planned steel plate or decide not to?
So take a pledge right now: never just write a file that is a simple bunch of numbers. Promise that you will always have even the most basic metadata in the file itself.
And here’s a prescription for how to make this easy: just a single print statement will make it much better, and a few will make it great.
For example, you might record the height of your child Susan as a function of time like this in a file called heights.txt:
0 50 1 74 2 86.5 3 95 4 102.5 5 109.5 […]
Column 0 has age and column 1 has the height measured in centimeters, and the data refers to your child Susan. But if this file ends up in someone else’s hands, the descriptive information might have gotten lost. This is awful and you should never do it.
It is already much better to add a single line at the top:
##COMMENT: Height (cm) as a function of age (years) for Susan 0 50 1 74 2 86.5 3 95 4 102.5 5 109.5 […]
This is human readable, and most plotting and analysis programs will ignore the structured comment lines that start with a #, so it is already a huge improvement on the initial file. Do at least that much!
But you can do more: that version was human readable, but you could make it machine-readable with a simple bit of work:
##COMMENT: Height as a function of age for Susan ##LAST_MODIFIED: 2020-05-12 ##FILENAME: heights.txt ##WRITTEN_BY: record-kid-height.py ##WRITTEN_BY_VERSION: 1.3.2~alpha.0 ##COLUMN0: age ##COLUMN0_UNITS: years ##COLUMN1: height ##COLUMN1_UNITS: cm 0 50 1 74 2 86.5 3 95 4 102.5 5 109.5 […]
A couple of decades ago I started proposing this approach, and have now started calling it Low Effort Metadata (LEM, or LeMeta). When I’m involved in a project and see bare-numbered data files I quickly add in the few lines needed to give LEM.
At one point I had developed C and Python code to generate LEM, but there’s really no need and a library just gives a barrier to entry. Just write the information out yourself with print statements, keeping in mind a few guiding principles:
- if nothing else, just print what you’re thinking as a simple English phrase (or another language if the project is very highly localized)
- make it column-oriented
- document each column’s meaning and units of measure
- feel free to put ##COMMENT: lines to tell a story
- put in fields that relate to when the software that wrote this file was built
You could then go to town on other things, like a file format revision number, but you must walk a line of barrier to adoption by your team and full and useful information.