Today I saw a couple of scientific papers which were written with a word processor (probably Microsoft Word, but possibly libreoffice) instead of a typesetting approach (such as TeX/LaTeX). I realized how fortunate I am that I seldom have to read papers like that: it is tiring to read. Most papers I read fortunately are typeset with TeX as a back end engine

There are two major reasons for which a scientist writing a paper (or anyone writing a nontrivial document) should not make typesetting decisions:

1. Typesetting is an old and deep art which really matters. You can get fatigue from reading a book or article that is typeset poorly. Amateurs don’t know this art. One simple example is the “optimal line length”: lines should have about 60 to 65 characters, and reading is impaired when they are longer or shorter. (Try looking through a typical book and count how many letters are on an average line.) People who set their own margins will violate this and their readers will get fatigue. There are very many more areas in typesetting where amateurish tinkering will make the reading experience subtly more tiring.

2. It is a waste of time. Fiddling with margins, fonts, boldface is not the author’s job. The author should worry about the meaning of what s/he is writing, not the format. By playing with your word processor you are indulging in something that doesn’t give that much satisfaction and certainly does not increase your productivity.

This applies to any document that is more than a few paragraphs long, but it gets even worse when mathematical notation is involved.

In LaTeX you write your text and the program does the structuring and typesetting for you, applying a style that was designed by professionals who spent a lot of time researching how to do it and who have the benefit of centuries of wisdom on book layout. A very good introduction is the classic article The Not So Short Introduction to LaTeX

There are a couple of things to keep in mind when getting going with LaTeX rather than a word processor approach:

1. Someone put it well once saying that “in TeX the easy things are hard to do and the hard things are easy to do”. Getting started on your document involves including some preamble text to specify the document structure, and using special markup commands to make lists, which might seem harder than using a word processor. If you write a lot this learning curve disappears in a hurry, and you reap the benefits of having the truly difficult and time-consuming tasks done for you automatically.

2. Make sure that you remember that “user friendliness” is a subtle matter. The acronym WYSIWYG (“what you see is what you get”) has been well parodied as WYSIAYG (“what you see is all you get”).

Now although LaTeX guides you toward a well-structured document, many people go and find the special commands that can be used to change the layout. The most frequent thing I see is people wanting tiny margins to make a paper look like it has fewer pages (two-column mode might be better for this purpose).

There is an old saying among programmers from my generation, who were writing code in the mid-1980s: you can “write FORTRAN in any language”. Since this is a rambling long-winded rant I will explain this old adage by reminding you of the context in those years. FORTRAN, which appeared in 1957, could be said to be the “first programming language”, preceding LISP (1958) and COBOL (1959).

People who learned to program with languages like C and Pascal in the 1980s tended to poo-poo FORTRAN code as being very ugly, and it really is: the widely used paradigms for programming in FORTRAN were almost painfully hard to follow. But mostly FORTRAN suffered from the fact that it was designed before people used keyboards on computers: FORTRAN was made to be put on stacks of “punch cards” and fed into a mechanical reader, so the code was formatted strangely and hard to read.

Hackers like to have “religious wars” on which programming language is “better” — just recently three of the programmers I admire most, and who are not young anymore, were having at it again on the topic of C++ and ugliness. Back in the mid-1980s young programmers were learning C and proud of it, and maybe experimenting with some trendy new languages like Smalltalk and Modula-II, and we would say that FORTRAN was gross.

But eventually, as the dialectic process of understanding such issues had time to work its magic, we realized that a lot of C code is dreadfully difficult to read, and at a certain point I even realized that some FORTRAN code was intelligently written (and modern versions of FORTRAN are almost-but-not-quite readable). Hence the expression “you can write FORTRAN in any language”, which is probably equivalent to “l’abito non fa il monaco”, or any other expression about pigs and lipstick. There is a scholarly article on it by Donn Seeley (make sure you read the PDF, their text rendition is poor).

(Don’t take this to mean that I think all languages are OK; many programming languages lack in expressiveness, or are really ugly in a variety of ways.)

So in the spirit of a rant I spent much more time on a side show. Ah well… The point is: you can do poor typesetting in LaTeX too, just as you can in MS-Word. The difference is that in LaTeX you have to work hard to do poor typesetting, while in Word you cannot do good typesetting. Some day the word processors might add some of the important points of typesetting (ligatures, intelligent spacing, good math formulae), at which point it will become possible to do good typesetting, but it will still be very difficult.

If you want to see more on this, including a lot of rants on how bad word processors are, a web search for the title of this posting gives some funny ones.

Posted on by | Leave a comment

starting with hdf5 — writing a data cube in C

pre(r)amble

Let us start using hdf5 to write out a “data cube” — in our example a series of scalar values f(x, y, z).

hdf5 is a library written in C but it can be called from many programming languages.  We will start by creating and writing a data cube from a C program (don’t worry, I will also show you a python version in another post), and then we will discuss various ways of examining this data cube. Edit a file, for example make-sample-cube.c, and start with:

the code

First include these .h files:

#include <hdf5.h>
#include <hdf5_hl.h>

This gives you access to the API. We will also need some other include files

We will need some more include files:

#include <stdlib.h>
#include <math.h>
#include <assert.h>

Let’s go top-down and write a main() routine that shows what we want to do in this program. We will have a fill_cube() routine which creates our data cube from mathematical functions, and a write_cube() routine which saves them to disk as hdf5 files. Since we’re taking a top-down approach, we declare their prototypes before the main() function:

double *fill_cube(int nx, int ny, int nz);
void write_cube(double *data, int nx, int ny, int nz, char *fname);

then the actual main():

int main(int argc, char *argv[])
{
  double *data_cube;
  int nx, ny, nz;
  nx = 64;
  ny = 57;
  nz = 50;
  char *fname = "sample_cube.h5";
  data_cube = fill_cube(nx, ny, nz);
  write_cube(data_cube, nx, ny, nz, fname);
  free(data_cube);
  return 0;
}

Now let’s write the functions we use:

/** 
 * Fills up a cube with artificial data so that the various slices
 * will be interesting to examine and visualize.  Note that this
 * routine uses malloc() to allocate memory for the cube, so the
 * caller has to use free() when it has finished using this cube.
 * 
 * @param nx 
 * @param ny 
 * @param nz 
 * 
 * @return the data cube, with space freshly allocated by malloc()
 */
double *fill_cube(int nx, int ny, int nz)
{
  int i, j, k;
  double x, y, z;
  double *data;
  data = (double *) malloc(nx*ny*nz*sizeof(*data));
  for (i = 0; i < nx; ++i) {
    for (j = 0; j < ny; ++j) {
      for (k = 0; k < nz; ++k) {
        x = (i - nx/2)/100.0;
        y = (j - ny/2)/100.0;
        z = (k - nz/2)/100.0;
        double val = (exp((-(z)*(z)/1.0
                           -(y)*(y)/1.0)
                          * (i+1)*(i+1)/10.0)
                      * (x+1)*(x+1));
        data[i*ny*nz + j*nz + k] = val;
      }
    }
  }
  return data;
}

Finally the write_cube() routine:

/** 
 * Writes a simple data cube out to disk as an hdf5 file
 * 
 * @param data 
 * @param nx 
 * @param ny 
 * @param nz 
 * @param fname 
 */
void write_cube(double *data, int nx, int ny, int nz, char *fname)
{
  /* now the various steps involved in preparing an hdf5 file */
  hid_t file_id;
  /* at this time our basic file just has the values strung out,
     so the hdf5 rank is 1 */
  hsize_t dims[3] = {nx, ny, nz};
  herr_t status;

  /* create a HDF5 file */
  file_id = H5Fcreate(fname, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
  /* create and write a double type dataset named "/datacube" */
  status = H5LTmake_dataset(file_id, "/datacube", 3, dims,
                            H5T_NATIVE_DOUBLE, data);
  assert(status != -1);
  /* add some hdf5 attributes with the metadata we need */
  H5LTset_attribute_int(file_id, "/datacube", "nx", &nx, 1);
  H5LTset_attribute_int(file_id, "/datacube", "nx", &nx, 1);
  H5LTset_attribute_int(file_id, "/datacube", "ny", &ny, 1);
  H5LTset_attribute_string(file_id, "/datacube", "x_units", "m");
  H5LTset_attribute_string(file_id, "/datacube", "y_units", "kg");
  H5LTset_attribute_string(file_id, "/datacube", "z_units", "degK");

  status = H5Fclose (file_id); assert(status != -1);
  assert(status != -1);
}

compiling and running

You will need the hdf5 library, which you can get on an Ubuntu or Debian system with:

sudo apt-get -u install libhdf5-serial-dev

You can compile this program with:

gcc -o make-sample-cube make-sample-cube.c -lm -lhdf5 -lhdf5_hl

You can run this program with:

./make-sample-cube

This creates a file called sample_cube.h5, and in a future post I will discuss various ways of looking at this data cube: slicing it and visualizing it.

discussion 1

Now I should find those honeyed words which allow you to wrap your head around the use of hdf5, understanding the broad strokes and the nuances at the same time. This is partly possible for this program, but you will soon learn that hdf5 goes into some brutal (and powerful) nitty gritty, at which point simple explanations miss the point entirely.

As usual with any library we #include the hdf5 .h files. Then we build a cube of data, and it’s worth mentioning in passing how we store the cube data in a C array:

C is clumsy for storing multidimensional arrays: you either hope you’re lucky and you have fixed size arrays (this happens when you’re first learning C, but not after). Otherwise you have to represent your multidimensional array using a C one-dimensional array, or do many nested memory allocations so that you can still use the syntax a[i][j][k].

After experimenting with various approaches people usually end up not using multiple memory allocations, but rather using a single one-dimensional array and index arithmetic to access the data in there. The routine fill_cube(), for example, uses this approach: what you might ideally represent as:

data[i][j][k] = val;

is instead represented as:

data[i*ny*nz + j*nz + k] = val;

this is called “row-major ordering” (where the last index is the one that varies most rapidly) and it is the convention used in C and many other languages (with the exceptions of FORTRAN and Matlab, which use column-major ordering). You probably don’t need to learn too much more, but you can read the wikipedia article on row-major ordering.

how the hdf5 writing is done

With that out of the way, let us see how we write the hdf5 file. hdf5 allows very complex and diverse ways of saving data. For example, you can have multiple datasets (these are almost always arrays of numbers) with all sorts of attributes (this is almost always metadata).

In this example we have a single dataset (our cube) and we start with very simple metadata/attributes (just the sizes and the units, which we have playfully set to be meters, kilograms and degrees Kelvin). This allows us to use the “hdf5 lite (H5LT)

The approach you can see in the code is that you take the following steps:

  1. Create an hdf5 file (it starts out empty).
  2. Make a dataset — this turns your array of data into an hdf5 “dataset”.
  3. Set some attributes (i.e. metadata).
  4. Close the file.

Note that there was no write() type of call: whatever you do with the data sets associated with an open file are guaranteed to be written out when you close the file.

discussion 2 with mild rant

If this were all you ever do with hdf5 I would say that the C interface to writing hdf5 files is a very good example of a C API. Unfortunately as soon as we go beyond this simple type of example (for example to read slices of data in efficient ways) the programming gets very detailed and ugly, and in my opinion this is one of the current design problems with hdf5: there is a nice API (the H5LT layer) for very simple cases (which you will very soon outgrow), and a powerful (but not nice) API for the full-blown library, but there is no intermediate level API for what most programmers will need most of the time.

Posted in data, rant, try this | Tagged , , , | Leave a comment

file formats for scientific data

I spent some time this last winter examining a variety of file formats for storing scientific data.  I also organized a “debate” in Los Alamos with people presenting various file formats.  I reached the following conclusions:

  • For itsy-bitsy-teeny-weeny data files, or configuration information, CSV (comma separated values) is OK.  There is not a clear standard, but reader libraries exist for many languages and if your file starts getting parsed incorrectly because of lack of standardization then you should move to another format!
  • For small files, less than 100 megabytes, I like ascii files with “structured comment metadata” (more below).
  • For medium-large files, 100 megabytes to 10 gigabytes, I like hdf5 with simple and somewhat ad-hoc approaches to
  • For very large data sets you have to start using a database approach, and there is great loss of innocence (i.e. introduction of complexity) here.  Current SQL databases have sad limitations vis-a-vis scientific data, but are probably required for effective searches.  A hybrid approach of and SQL database pointing to hdf5 for mass storage might work. PostgreSQL is probably the best current choice for scientific data.
  • scidb is designed very carefully to address these limitations in SQL databases.  It’s a work in progress and not yet ready for robust deployment.

There are also formats whose adoption is a historical mistake and these formats should not be used to store scientific data.  People in the fields in which these formats are used should be leading their communities to move away from those formats:

  • FITS: this format is widely used in astronomy and it has very severe limitations which make it an albatross around the neck of astronomical data work.  Sometimes you have a poor format with a nice way for programs to read that format, and you can put up with it, but in the case of FITS it has a poor file format and the reference API for file access (CFITSIO) is about as bad a library interface as I have seen.
  • CDF: this was NASA’s original attempt at a proper data format.  It was an OK design for its time, but it suffers from old design and lack of development momentum.  Its portion of the data storage world should migrate to hdf5.

So how do you lead a community away from a file format?  Arguing against it does not help too much — what will make the difference is when a good hacker on a project that uses FITS presents a parallel way of doing FITS and hdf5.  There is a fits2hdf script, and if all astronomy programs can be adapted to read the hdf5 file as well as the fits file then it’s a good start.

That way you show your collaborators that you are not requiring them to make a sudden change, and they can look at the new approach while not having been forced out of their comfort zone.

Posted in data | Tagged , , , , , | 3 Comments

can anyone believe your results? (reproducibility)

Here are things I have heard working scientists say:

  • I made the plot in excel
  • I played with the plot in excel until it looked good
  • I’m not sure how I got from the raw data to the plot
  • I’m not sure which data collection run gave me that data file
  • my collaborator sent me that data file in an email attachment; I fixed a problem and sent it back to him in an email attachment

I have the impression that a lot of papers I look at have results that are not reproducible.  If the papers made a claim to a major discovery with great practical consequence, they would be examined very closely and might shown to be unreproducible.

The Wikipedia article on Cold Fusion has an instructive and sadly amusing blow-by-blow description of the circus that followed the 1988 announcement that Fleischmann and Pons had found the solution to the world’s energy problems with cold fusion.  This was not a typical boring and pointless paper, so people all over the world tried to reproduce their experiment, and it turns out there was nothing there, although interestingly some groups proved that wishful thinking can be quite powerful.

Much has been written about Fleischmann and Pons, but I have not found if their problem was lack of reproducibility (so they found a plot that looked good and just focused on that) or if there was something darker at work.  (Does anyone know the inner story on this?)

But what is quite clear is that if your bit of research is important then it will be scrutinized and you had better have a clear reproducible trail leading from the experiment (or simulation), through the various stages of calibration adjustment and analysis, up to the plots in the paper.  If you do not have  an automatic way of doing all that, then you will be embarrassed by this scrutiny.  If your collected data cannot be automatically linked (probably via metadata) to the exact instrument configuration at the time of collection, then you do not really have a scientific result: you have something suggestive but not convincing.

A good example I have seen of this rigor for a large scale project is Andy Fraser’s book on Hidden Markov Models.  You can download the book and what you get is a source code archive — you run the compilation scripts and they build the book, including running all the programs which generate the data for the plots, and then generate the plots themselves.  Dependencies are tracked: if the time stamp on a file is changed, everything that requires that file’s information will be rebuilt.  (Yes, it uses “make”.)

In an experimental setting this is even more important: raw data is taken once, but the information used to process that raw data might be updated, at which point the raw data has to be turned into finished data products automatically, without human intervention.  This is often not done or left until later (i.e. never).

I think that the solution to this problem involves a cocktail of the following, which really mesh with each other:

  • availability of raw data
  • availability of all processing codes
  • provenance
  • metadata
  • version control
  • software pipelines

I plan to discuss how these ideas play into good reproducible science and how one should program to guarantee reproducibility.

… some vaguely related links:

Carlo Graziani’s article on Ed Fenimore and honesty in science (in particular his link on the Ginga lines)

Discussions of the Fermilab tevatron mirage

Blas Cabrera’s possible detection of a magnetic monopole in 1982

The Atlantic Monthly’s article Lies, Damned Lies, and Medical Science

Posted in meta, rant | Tagged | 1 Comment

Kevin McCarty’s “physics software rant”

I have always liked Kevin McCarty’s Physics Software Rant.  I think he’s on the money in pretty much all his points.

He starts out with “First and most importantly — Choose a license!”, then goes on to show many specific examples of Physics software which could do their releasing much better, then concludes with:

Although I have been harsh, it was not my intent to insult anyone. I am certainly grateful for the wide variety of free physics software available on the Internet. I just wish that, after spending man-years developing their software, people would take a few extra days putting finishing touches on it to avoid problems like those discussed above. This would go a long way towards making physicists and sysadmins everywhere happy, and wasting a lot less of their time.

I recommend reading through it if you are writing software which you intend to release to others.  You can find it here: http://starplot.org/articles/physics-software-rant.html

Posted in rant | Tagged , , | Leave a comment

volume visualization — mayavi and mayavi2

Some programs are truly delightful; I have long had a list of those that charmed me.

One of these was  mayavi, the volume visualization program.  Volume visualization is quite different from what people think of as “3D graphics” — a surface plot might be called 3D graphics, although you are really looking at a 2D surface embedded in a 3D space.

A simple example of surface plotting with gnuplot would be:


$ gnuplot

gnuplot> set pm3d

gnuplot> splot (x**2)*(y**2)

which shows the function f(x, y) = x^2 * y^2 plotted on the z axis.

Volume visualization would involve showing functions of (x, y, z) (not just x, y).  Since we have no fourth axis upon which to show this, we use a variety of tricks to visualize it.  One example of such a function would be density.

mayavi implements many of these tricks and allows you to navigate with the best virtual reality navigation mouse bindings I have seen — left for rotation, middle for translation, right for zoom.

You can run a simple command line example of mayavi2 like this (paths from the Debian/Ubuntu distribution):

mayavi2 -d /path/to/heart.vtk -m Outline -m IsoSurface -m GridPlane -m ScalarCutPlane &

I was disappointed with mayavi2, and if mayavi1.5 were still packaged I would probably still use that.  mayavi2 has all sorts of fancy features, but it is huge and slow and not the tight little program that focuses on what I used to use it for.  Also: the authors have tried to turn it into a platform, but I’m not so much of a platform kind-o-guy — I would have preferred that they simply improve the API to allow a python program to invoke mayavi on a memory-resident data set, but it looks like mayavi2 is now the mainstream version.

For a tutorial on mayavi2 and their heart.vtk example try this: mayavi2 heart.vtk tutorial

Posted in try this | Leave a comment

should a scientist be a hacker?

I always enjoyed programming, so I’m inclined to think that all scientists should also be good hackers.  On the other hand many very good scientists are not, so my inclination is probably incorrect.

Still, every scientist should know how to program to some extent, and there should be a simple path as scientists move through university and the PhD process to let them choose a level of programming expertise and learn up to that level, as well as to know what levels come beyond that.

It’s important to teach researchers to do real programming and not just using user-level software, so a physics department should make sure that their students know how to write serious standalone programs: it is a disservice to let students finish university thinking that they will never need to know more than mathematica or matlab.

For example, here is a possible outline of levels of programming a physicist should know about:

  1. learn python so that you can read a data file and manipulate the file, possibly converting it to other formats
  2. learn a plotting system to use with python, such as matplotlib
  3. learn to program in C so that you can write low-level software to control hardware experiments
  4. learn to make a build system for your software, make and possibly more advanced stuff like autotools or cmake’
  5. use version control (RCS for “just yourself” single files, Mercurial for larger distributed projects)
  6. learn to be a sysadmin
  7. learn to use metadata and some file formats, from ascii columns with structured comment metadata to CSV files to HDF5
  8. learn to package your software for Debian-based or RPM-based distributions
  9. learn to write a graphical user interface in Python, possibly using wxPython as your widget set
  10. learn to program SQL databases like postgres (maybe some day better things like SciDB will be available) using the very high level language library bindings
  11. now that you have learned so many software systems, spend some time learning how to navigate the trade-offs between simplicity and power that all these choices bring

one of your advisor’s jobs at university would be to help you identify which of these levels you want to reach and to find a mentor to help you reach that level.

For life science and social science the requirements are probably fewer.  Physicists develop instrumentation and electronics, but biologists and anthropologists usually don’t, so programming in C is probably less important.  On the other hand for them it’s important to learn pre-packaged statistical systems like  R (http://www.r-project.org/)

Posted in meta, unix | 2 Comments