file formats for scientific data

I spent some time this last winter examining a variety of file formats for storing scientific data.  I also organized a “debate” in Los Alamos with people presenting various file formats.  I reached the following conclusions:

  • For itsy-bitsy-teeny-weeny data files, or configuration information, CSV (comma separated values) is OK.  There is not a clear standard, but reader libraries exist for many languages and if your file starts getting parsed incorrectly because of lack of standardization then you should move to another format!
  • For small files, less than 100 megabytes, I like ascii files with “structured comment metadata” (more below).
  • For medium-large files, 100 megabytes to 10 gigabytes, I like hdf5 with simple and somewhat ad-hoc approaches to metadata.
  • For very large data sets you have to start using a database approach, and there is great loss of innocence (i.e. introduction of complexity) here.  Current SQL databases have sad limitations vis-a-vis scientific data, but are probably required for effective searches.  A hybrid approach of and SQL database pointing to hdf5 for mass storage might work. PostgreSQL is probably the best current choice for scientific data.
  • scidb is designed very carefully to address these limitations in SQL databases.  It’s a work in progress and not yet ready for robust deployment.

There are also formats whose adoption is a historical mistake and these formats should not be used to store scientific data.  People in the fields in which these formats are used should be leading their communities to move away from those formats:

  • FITS: this format is widely used in astronomy and it has very severe limitations which make it an albatross around the neck of astronomical data work.  Sometimes you have a poor format with a nice way for programs to read that format, and you can put up with it, but in the case of FITS it has a poor file format and the reference API for file access (CFITSIO) is about as bad a library interface as I have seen.
  • CDF: this was NASA’s original attempt at a proper data format.  It was an OK design for its time, but it suffers from old design and lack of development momentum.  Its portion of the data storage world should migrate to hdf5.

So how do you lead a community away from a file format?  Arguing against it does not help too much — what will make the difference is when a good hacker on a project that uses FITS presents a parallel way of doing FITS and hdf5.  There is a fits2hdf script, and if all astronomy programs can be adapted to read the hdf5 file as well as the fits file then it’s a good start.

That way you show your collaborators that you are not requiring them to make a sudden change, and they can look at the new approach while not having been forced out of their comfort zone.

Related posts:

starting with hdf5 — writing a data cube in C

Advertisements

About markgalassi

Mark Galassi is a research scientist in Los Alamos National Laboratory, working on astrophysics and nuclear non-proliferation.
This entry was posted in data, meta and tagged , , , , , . Bookmark the permalink.

3 Responses to file formats for scientific data

  1. Jesus Pestana says:

    I am a PhD Student and I need to select a binary type scientific data format. There seems to be 3 choices: CDF, netCDF, HDF5. I need to perform data acquisition while running a parallel control (for instance, some nested PID controllers) thread. I am selecting a binary format because 10-25 variables at 200Hz for about 10 minutes is quite a ton of data for an ASCII file :(.

    As far as I can tell the biggest difference between HDF5 and the other two, is the possibility to organize the data in groups. I am also not sure what “thread-safe” and “parallel I/O” means in this context. I have found the FAQ of the HDF5 quite confusing regarding this issue.

    Could you please clarify your point about HDF5 being so much better than CDF and netCDF?

    Thanks for your help,

    Jesus

    • markgalassi says:

      Dear Jesus,

      If you don’t use multithreading and you don’t do parallel programming then you can ignore those issues.

      The story on CDF and NetCDF is easily settled because CDF is pretty much dead, so you will not have any serious future development. And it’s a very ugly format and library, obviously developed with old FORTRAN programmers in mind.

      NetCDF has now moved to using HDF5 as its underlying file format (with slight restrictions), so if you write NetCDF files you are really writing HDF5 files. My understanding is that NetCDF might have a slightly nicer API, but is slower.

      • jespestana says:

        I am getting a look to the HDF5 format and it seems pretty nice..
        Thanks for the information :).

        By the way, I agree with the curriculum problem in the universities (another of your blog entries).. I think that good programming is really important for scientists. It’s a pity that I cannot choose to have more C++ programming courses (and projects) instead of other less interesting subjects..

        Jesus

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s