I spent some time this last winter examining a variety of file formats for storing scientific data. I also organized a “debate” in Los Alamos with people presenting various file formats. I reached the following conclusions:
- For itsy-bitsy-teeny-weeny data files, or configuration information, CSV (comma separated values) is OK. There is not a clear standard, but reader libraries exist for many languages and if your file starts getting parsed incorrectly because of lack of standardization then you should move to another format!
- For small files, less than 100 megabytes, I like ascii files with “structured comment metadata” (more below).
- For medium-large files, 100 megabytes to 10 gigabytes, I like hdf5 with simple and somewhat ad-hoc approaches to metadata.
- For very large data sets you have to start using a database approach, and there is great loss of innocence (i.e. introduction of complexity) here. Current SQL databases have sad limitations vis-a-vis scientific data, but are probably required for effective searches. A hybrid approach of and SQL database pointing to hdf5 for mass storage might work. PostgreSQL is probably the best current choice for scientific data.
- scidb is designed very carefully to address these limitations in SQL databases. It’s a work in progress and not yet ready for robust deployment.
There are also formats whose adoption is a historical mistake and these formats should not be used to store scientific data. People in the fields in which these formats are used should be leading their communities to move away from those formats:
- FITS: this format is widely used in astronomy and it has very severe limitations which make it an albatross around the neck of astronomical data work. Sometimes you have a poor format with a nice way for programs to read that format, and you can put up with it, but in the case of FITS it has a poor file format and the reference API for file access (CFITSIO) is about as bad a library interface as I have seen.
- CDF: this was NASA’s original attempt at a proper data format. It was an OK design for its time, but it suffers from old design and lack of development momentum. Its portion of the data storage world should migrate to hdf5.
So how do you lead a community away from a file format? Arguing against it does not help too much — what will make the difference is when a good hacker on a project that uses FITS presents a parallel way of doing FITS and hdf5. There is a fits2hdf script, and if all astronomy programs can be adapted to read the hdf5 file as well as the fits file then it’s a good start.
That way you show your collaborators that you are not requiring them to make a sudden change, and they can look at the new approach while not having been forced out of their comfort zone.