file formats for scientific data

I spent some time this last winter examining a variety of file formats for storing scientific data.  I also organized a “debate” in Los Alamos with people presenting various file formats.  I reached the following conclusions:

  • For itsy-bitsy-teeny-weeny data files, or configuration information, CSV (comma separated values) is OK.  There is not a clear standard, but reader libraries exist for many languages and if your file starts getting parsed incorrectly because of lack of standardization then you should move to another format!
  • For small files, less than 100 megabytes, I like ascii files with “structured comment metadata” (more below).
  • For medium-large files, 100 megabytes to 10 gigabytes, I like hdf5 with simple and somewhat ad-hoc approaches to metadata.
  • For very large data sets you have to start using a database approach, and there is great loss of innocence (i.e. introduction of complexity) here.  Current SQL databases have sad limitations vis-a-vis scientific data, but are probably required for effective searches.  A hybrid approach of and SQL database pointing to hdf5 for mass storage might work. PostgreSQL is probably the best current choice for scientific data.
  • scidb is designed very carefully to address these limitations in SQL databases.  It’s a work in progress and not yet ready for robust deployment.

There are also formats whose adoption is a historical mistake and these formats should not be used to store scientific data.  People in the fields in which these formats are used should be leading their communities to move away from those formats:

  • FITS: this format is widely used in astronomy and it has very severe limitations which make it an albatross around the neck of astronomical data work.  Sometimes you have a poor format with a nice way for programs to read that format, and you can put up with it, but in the case of FITS it has a poor file format and the reference API for file access (CFITSIO) is about as bad a library interface as I have seen.
  • CDF: this was NASA’s original attempt at a proper data format.  It was an OK design for its time, but it suffers from old design and lack of development momentum.  Its portion of the data storage world should migrate to hdf5.

So how do you lead a community away from a file format?  Arguing against it does not help too much — what will make the difference is when a good hacker on a project that uses FITS presents a parallel way of doing FITS and hdf5.  There is a fits2hdf script, and if all astronomy programs can be adapted to read the hdf5 file as well as the fits file then it’s a good start.

That way you show your collaborators that you are not requiring them to make a sudden change, and they can look at the new approach while not having been forced out of their comfort zone.

Related posts:

starting with hdf5 — writing a data cube in C

Posted in data, meta | Tagged , , , , , | 3 Comments

can anyone believe your results? (reproducibility)

Here are things I have heard working scientists say:

  • I made the plot in excel
  • I played with the plot in excel until it looked good
  • I’m not sure how I got from the raw data to the plot
  • I’m not sure which data collection run gave me that data file
  • my collaborator sent me that data file in an email attachment; I fixed a problem and sent it back to him in an email attachment

I have the impression that a lot of papers I look at have results that are not reproducible.  If the papers made a claim to a major discovery with great practical consequence, they would be examined very closely and might shown to be unreproducible.

The Wikipedia article on Cold Fusion has an instructive and sadly amusing blow-by-blow description of the circus that followed the 1988 announcement that Fleischmann and Pons had found the solution to the world’s energy problems with cold fusion.  This was not a typical boring and pointless paper, so people all over the world tried to reproduce their experiment, and it turns out there was nothing there, although interestingly some groups proved that wishful thinking can be quite powerful.

Much has been written about Fleischmann and Pons, but I have not found if their problem was lack of reproducibility (so they found a plot that looked good and just focused on that) or if there was something darker at work.  (Does anyone know the inner story on this?)

But what is quite clear is that if your bit of research is important then it will be scrutinized and you had better have a clear reproducible trail leading from the experiment (or simulation), through the various stages of calibration adjustment and analysis, up to the plots in the paper.  If you do not have  an automatic way of doing all that, then you will be embarrassed by this scrutiny.  If your collected data cannot be automatically linked (probably via metadata) to the exact instrument configuration at the time of collection, then you do not really have a scientific result: you have something suggestive but not convincing.

A good example I have seen of this rigor for a large scale project is Andy Fraser’s book on Hidden Markov Models.  You can download the book and what you get is a source code archive — you run the compilation scripts and they build the book, including running all the programs which generate the data for the plots, and then generate the plots themselves.  Dependencies are tracked: if the time stamp on a file is changed, everything that requires that file’s information will be rebuilt.  (Yes, it uses “make”.)

In an experimental setting this is even more important: raw data is taken once, but the information used to process that raw data might be updated, at which point the raw data has to be turned into finished data products automatically, without human intervention.  This is often not done or left until later (i.e. never).

I think that the solution to this problem involves a cocktail of the following, which really mesh with each other:

  • availability of raw data
  • availability of all processing codes
  • provenance
  • metadata
  • version control
  • software pipelines

I plan to discuss how these ideas play into good reproducible science and how one should program to guarantee reproducibility.

… some vaguely related links:

Carlo Graziani’s article on Ed Fenimore and honesty in science (in particular his link on the Ginga lines)

Discussions of the Fermilab tevatron mirage

Blas Cabrera’s possible detection of a magnetic monopole in 1982

The Atlantic Monthly’s article Lies, Damned Lies, and Medical Science

Posted in meta, rant | Tagged | 1 Comment

Kevin McCarty’s “physics software rant”

I have always liked Kevin McCarty’s Physics Software Rant.  I think he’s on the money in pretty much all his points.

He starts out with “First and most importantly — Choose a license!”, then goes on to show many specific examples of Physics software which could do their releasing much better, then concludes with:

Although I have been harsh, it was not my intent to insult anyone. I am certainly grateful for the wide variety of free physics software available on the Internet. I just wish that, after spending man-years developing their software, people would take a few extra days putting finishing touches on it to avoid problems like those discussed above. This would go a long way towards making physicists and sysadmins everywhere happy, and wasting a lot less of their time.

I recommend reading through it if you are writing software which you intend to release to others.  You can find it here:

Posted in rant | Tagged , , | Leave a comment

volume visualization — mayavi and mayavi2

Some programs are truly delightful; I have long had a list of those that charmed me.

One of these was  mayavi, the volume visualization program.  Volume visualization is quite different from what people think of as “3D graphics” — a surface plot might be called 3D graphics, although you are really looking at a 2D surface embedded in a 3D space.

A simple example of surface plotting with gnuplot would be:

$ gnuplot

gnuplot> set pm3d

gnuplot> splot (x**2)*(y**2)

which shows the function f(x, y) = x^2 * y^2 plotted on the z axis.

Volume visualization would involve showing functions of (x, y, z) (not just x, y).  Since we have no fourth axis upon which to show this, we use a variety of tricks to visualize it.  One example of such a function would be density.

mayavi implements many of these tricks and allows you to navigate with the best virtual reality navigation mouse bindings I have seen — left for rotation, middle for translation, right for zoom.

You can run a simple command line example of mayavi2 like this (paths from the Debian/Ubuntu distribution):

mayavi2 -d /path/to/heart.vtk -m Outline -m IsoSurface -m GridPlane -m ScalarCutPlane &

I was disappointed with mayavi2, and if mayavi1.5 were still packaged I would probably still use that.  mayavi2 has all sorts of fancy features, but it is huge and slow and not the tight little program that focuses on what I used to use it for.  Also: the authors have tried to turn it into a platform, but I’m not so much of a platform kind-o-guy — I would have preferred that they simply improve the API to allow a python program to invoke mayavi on a memory-resident data set, but it looks like mayavi2 is now the mainstream version.

For a tutorial on mayavi2 and their heart.vtk example try this: mayavi2 heart.vtk tutorial


Posted in try this | Leave a comment

should a scientist be a hacker?

I always enjoyed programming, so I’m inclined to think that all scientists should also be good hackers.  On the other hand many very good scientists are not, so my inclination is probably incorrect.

Still, every scientist should know how to program to some extent, and there should be a simple path as scientists move through university and the PhD process to let them choose a level of programming expertise and learn up to that level, as well as to know what levels come beyond that.

It’s important to teach researchers to do real programming and not just using user-level software, so a physics department should make sure that their students know how to write serious standalone programs: it is a disservice to let students finish university thinking that they will never need to know more than mathematica or matlab.

For example, here is a possible outline of levels of programming a physicist should know about:

  1. learn python so that you can read a data file and manipulate the file, possibly converting it to other formats
  2. learn a plotting system to use with python, such as matplotlib
  3. learn to program in C so that you can write low-level software to control hardware experiments
  4. learn to make a build system for your software, make and possibly more advanced stuff like autotools or cmake’
  5. use version control (RCS for “just yourself” single files, Mercurial for larger distributed projects)
  6. learn to be a sysadmin
  7. learn to use metadata and some file formats, from ascii columns with structured comment metadata to CSV files to HDF5
  8. learn to package your software for Debian-based or RPM-based distributions
  9. learn to write a graphical user interface in Python, possibly using wxPython as your widget set
  10. learn to program SQL databases like postgres (maybe some day better things like SciDB will be available) using the very high level language library bindings
  11. now that you have learned so many software systems, spend some time learning how to navigate the trade-offs between simplicity and power that all these choices bring

one of your advisor’s jobs at university would be to help you identify which of these levels you want to reach and to find a mentor to help you reach that level.

For life science and social science the requirements are probably fewer.  Physicists develop instrumentation and electronics, but biologists and anthropologists usually don’t, so programming in C is probably less important.  On the other hand for them it’s important to learn pre-packaged statistical systems like  R (

Posted in meta, unix | 4 Comments

Should a scientist be a power user?

The answer is quite obvious: a scientist should have a mastery of her computer tools.

A school curriculum in science should have a clear from user to power user progression for their students.

It used to be that people studying science either did not use computers at all, or they used them quite deeply because there was no user-friendly layer.  Ever since the web came along, the computing platform is often the web browser instead of a shell with close contact to the file system, so many students need to be taught how to navigate the file system.

This could consist of some of the following steps (please provide others!):

  1. emphasis on only using free software (I will probably discuss this more in some rant-like postings)
  2. learning the UNIX command line and developing a close coupling in one’s mind between graphical and command-line ways of doing things
  3. learning the file system, both as a basic concept and to see how operating system distributions organize it
  4. learning a programming editor and the meaning of ascii, markup, and what file formats “mean”
  5. knowing the difference between vector graphics and bitmap graphics
  6. knowing in depth a “standard set” of applications: word processor, spreadsheet, presentation tool, image editor, … (others?)

So there should be a path from user to power user, and this is probably appropriate for high school kids or very nerdy younger kids.

Posted in meta | Leave a comment

is there a curriculum problem in universities?

I have thought for a long time that most universities are neglecting at least one important part of preparing students for real research work: the confluence of mathematics, computing, and the specific scientific topic.

A student’s first research project in physics might consist of a junior-year internship with an advisor who says:

“I wrote this FORTRAN program back in the 1980s and I need you to modify it for situation blah blah blah…”

My personal feeling is that at some point in your career you have to start telling your boss “I think I know how to do this better than you are suggesting, so I’ll do it my way”, and this “modify the old FORTRAN program” request is a great moment to start this.

Of course you need to find a good dialectic synthesis between the exaggerated humility of saying “yes, sir, I will modify your old FOTRAN program” (which will leave the research project in a rut and not give you a good learning experience in your internship) and the exaggerated arrogance of saying “I will now rewrite this from scratch and make it so much better” — the internship world is littered with the remains of unfinished rewriting projects.

Unfortunately most students will take the “exaggerated humility” approach because this internship is the first time they have mixed computing with their science research.  If a student already knows how to program in C and Python, has already written simple GUIs, and can quickly manipulate data files, then she will be ready to bring value to a project.

From an informal survey of what happens on collage campuses in physics, it appears that students are taught calculus/analysis in the math department, they are taught electrodynamics and quantum mechanics in the physics department, and they might take a Java class in the computer science department.  And the physics department might have also offered them an introductory Mathematica class.

When they work on an internship they might be asked to generate numerical solutions to diffusion equations as part of a large software processing pipeline which requires various types of visualization and access to multi-gigabyte files in HDF5 format.  Or they might be asked to write a data acquisition program for a particular microcontoller board that reads data from an experiment’s electronics.  The curriculum snippet I mentioned earlier will not help at all with these tasks — they will need to know how to do real-world programming that brings together physics, mathematics, experiment control and visualization.

I’d be curious to hear people’s thoughts on this issue — is it as bad as I think it is?  Should universities take on the burden of teaching real-world research skills, or should it be left to internships?  What do people need to know?

Posted in meta, rant | 2 Comments