a bit of knowledge of regular expressions

Today a collaborator who is working toward more automation in her ordinary computing work asked me how to use the amazing command-line browser wget to get the images out of a web page.

wget has options to grab a whole web site: if you want a page and everything in child directories that is linked off the server you can use the “recursive”, “no parents” and “no clobber” options:

wget -r -np -nc http://www.gnu.org/software/wget/manual/html_node/index.html

will give you a basic mirror of that site, and it will work quite well for that one because it does not have active content that needs to be followed heuristically.  But it will not retrieve linked files or images unless they are under the top level directory of that URL.

So this specific page is “The Illustrated Guide to a Ph.D.”, which has gone viral recently.  Note that when the openculture web site took it in they separated out the images, so if you look at the web page’s source to the page (control-U in most browsers) you will find it to be under http://www.openculture.com/2010/09/ , but its images are stored at locations such as: http://www.openculture.com/images/PhDKnowledge.001.jpg

So the simple wget command will not grab the images.  You might want to automate the process of downloading them all, and there is a very simple shell command sequence to do so.  Start by making yourself a sandbox area:

mkdir ~/sandbox
cd ~/sandbox

Then grab the top level URL:

wget http://www.openculture.com/2010/09/the_illustrated_guide_to_a_phd.html

Now you will have the file the_illustrated_guide_to_a_phd.html in your current directory.  Next experiment with grep and sed to get the list of jpg URLs:

grep '\.jpg' the_illustrated_guide_to_a_phd.html

Now that you see what lines have a .jpg in them, use the sed command to edit out everything except for the URL.  A typical line looks like:

<div class="graphic"><img alt="" src="http://www.openculture.com/images/PhDKnowledge.006.jpg" width="440" /></div>

which clearly needs to be cleaned up.  This sed command will use a regular expression that matches everything between the =” and the jpg” and prints out just that part.  It uses the \( and \) for making a “group” of the interesting part. The group is then what gets printed with the \1

grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'

which gives you lines like:

http://www.openculture.com/images/PhDKnowledge.005.jpg

you are now ready to use wget on those individual image URLs:

JPG_LIST=`grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'`

Then iterate through that list grabbing each URL:

for jpg_url in $JPG_LIST
do
    echo $jpg_url
    wget "$jpg_url"
done

To see this whole inline script (you can paste it in as it is):

mkdir ~/sandbox
cd ~/sandbox
wget http://www.openculture.com/2010/09/the_illustrated_guide_to_a_phd.html
JPG_LIST=`grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'`

for jpg_url in $JPG_LIST
do
    echo $jpg_url
    wget "$jpg_url"
done

A final note: the original article by Matthew Might was on his web site and he had organized his page to have the images in the hierarchy of the HTML.  This is a more robust web site layout, and the recursive wget command would have mirrored it well:

wget -r -np -nc http://matt.might.net/articles/phd-school-in-pictures/
find matt.might.net -name '*.jpg' -print
Posted in scripting, try this, unix | Tagged , , | Leave a comment

programmers who get respect

I was just re-reading this interesting rant from Zed Shaw, author of “Learn Python The Hard Way” and other “the hard way” books: http://learnpythonthehardway.org/book/advice.html

The whole blurb is worth reading to see his point of view, although I disagree with some of what he says. I found myself really enjoying this paragraph:

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

I also started looking at Shaw’s new book “Learn C The Hard Way” and came upon this page: in which he gives the advice:

WARNING: Do Not Use An IDE

An IDE, or “Integrated Development Environment” will turn you stupid. They are the worst tools if you want to be a good programmer because they hide what’s going on from you, and your job is to know what’s going on. They are useful if you’re trying to get something done and the platform is designed around a particular IDE, but for learning to code C (and many other languages) they are pointless.

I like it when other people save me the trouble of using the harsh words.

Posted in meta, rant | Leave a comment

Using PyEphem to get the ground coordinates of a satellite

The PyEphem package is pretty amazing. The tutorial will show you many things.

I had a simple application for it: I needed to take a satellite’s orbital information and get the ground longitude/latitude under that satellite at a given time.

A satellite’s orbital information is usually given by what are called “two line elements” (TLEs) (see the example below), but some calculation is necessary to go from a TLE to the position over earth at a given time. PyEphem does it with great simplicity.

You can install PyEphem with Python’s pip packaging system:

pip install pyephem

and then:

  1. start with the satellite TLE, which you can obtain in a variety of ways. In real life we might get this from a program which grabs it from the web (for example here), but to hard-code a simple case (international space station) we can simply paste this code into a Python interpreter:
    import ephem
    import datetime
    ## [...]
    name = "ISS (ZARYA)";
    line1 = "1 25544U 98067A   12304.22916904  .00016548  00000-0  28330-3 0  5509";
    line2 = "2 25544  51.6482 170.5822 0016684 224.8813 236.0409 15.51231918798998";
  2. create a PyEphem Body object from it:
    tle_rec = ephem.readtle(name, line1, line2)
    tle_rec.compute()

    Note that the ephem.readtle() routine creates a PyEphem Body object from that TLE, and the compute() method recalculates all the parameters in the Ephem body for the current moment in time (datetime.datetime.now()).

    You can use any moment in time and calculate the values at that time with something like tle_rec.compute(datetime.datetime(2012, 1, 8, 11, 23, 42)) for January 8, 2012, 11:23:42am.

  3. Obtain the longitude and latitude from the tle_rec object:
    print tle_rec.sublong, tle_rec.sublat

    These sublong and sublat values are expressed as PyEphem Angle objects; when printed they are human readable longitude/latitude strings: -44:41:57.2 47:52:58.9. When accessed as real numbers they are in radians, so keep that in mind when adapting them to use with something like Python’s excellent Basemap package, which uses degrees instead of radians.

    So, two lines of code to go from TLE to ground longitude/latitude; pretty good, eh?

Posted in mapping, python, try this | Tagged | Leave a comment

word processor versus typesetter

Today I saw a couple of scientific papers which were written with a word processor (probably Microsoft Word, but possibly libreoffice) instead of a typesetting approach (such as TeX/LaTeX). I realized how fortunate I am that I seldom have to read papers like that: it is tiring to read. Most papers I read fortunately are typeset with TeX as a back end engine

There are two major reasons for which a scientist writing a paper (or anyone writing a nontrivial document) should not make typesetting decisions:

1. Typesetting is an old and deep art which really matters. You can get fatigue from reading a book or article that is typeset poorly. Amateurs don’t know this art. One simple example is the “optimal line length”: lines should have about 60 to 65 characters, and reading is impaired when they are longer or shorter. (Try looking through a typical book and count how many letters are on an average line.) People who set their own margins will violate this and their readers will get fatigue. There are very many more areas in typesetting where amateurish tinkering will make the reading experience subtly more tiring.

2. It is a waste of time. Fiddling with margins, fonts, boldface is not the author’s job. The author should worry about the meaning of what s/he is writing, not the format. By playing with your word processor you are indulging in something that doesn’t give that much satisfaction and certainly does not increase your productivity.

This applies to any document that is more than a few paragraphs long, but it gets even worse when mathematical notation is involved.

In LaTeX you write your text and the program does the structuring and typesetting for you, applying a style that was designed by professionals who spent a lot of time researching how to do it and who have the benefit of centuries of wisdom on book layout. A very good introduction is the classic article The Not So Short Introduction to LaTeX

There are a couple of things to keep in mind when getting going with LaTeX rather than a word processor approach:

1. Someone put it well once saying that “in TeX the easy things are hard to do and the hard things are easy to do”. Getting started on your document involves including some preamble text to specify the document structure, and using special markup commands to make lists, which might seem harder than using a word processor. If you write a lot this learning curve disappears in a hurry, and you reap the benefits of having the truly difficult and time-consuming tasks done for you automatically.

2. Make sure that you remember that “user friendliness” is a subtle matter. The acronym WYSIWYG (“what you see is what you get”) has been well parodied as WYSIAYG (“what you see is all you get”).

Now although LaTeX guides you toward a well-structured document, many people go and find the special commands that can be used to change the layout. The most frequent thing I see is people wanting tiny margins to make a paper look like it has fewer pages (two-column mode might be better for this purpose).

There is an old saying among programmers from my generation, who were writing code in the mid-1980s: you can “write FORTRAN in any language”. Since this is a rambling long-winded rant I will explain this old adage by reminding you of the context in those years. FORTRAN, which appeared in 1957, could be said to be the “first programming language”, preceding LISP (1958) and COBOL (1959).

People who learned to program with languages like C and Pascal in the 1980s tended to poo-poo FORTRAN code as being very ugly, and it really is: the widely used paradigms for programming in FORTRAN were almost painfully hard to follow. But mostly FORTRAN suffered from the fact that it was designed before people used keyboards on computers: FORTRAN was made to be put on stacks of “punch cards” and fed into a mechanical reader, so the code was formatted strangely and hard to read.

Hackers like to have “religious wars” on which programming language is “better” — just recently three of the programmers I admire most, and who are not young anymore, were having at it again on the topic of C++ and ugliness. Back in the mid-1980s young programmers were learning C and proud of it, and maybe experimenting with some trendy new languages like Smalltalk and Modula-II, and we would say that FORTRAN was gross.

But eventually, as the dialectic process of understanding such issues had time to work its magic, we realized that a lot of C code is dreadfully difficult to read, and at a certain point I even realized that some FORTRAN code was intelligently written (and modern versions of FORTRAN are almost-but-not-quite readable). Hence the expression “you can write FORTRAN in any language”, which is probably equivalent to “l’abito non fa il monaco”, or any other expression about pigs and lipstick. There is a scholarly article on it by Donn Seeley (make sure you read the PDF, their text rendition is poor).

(Don’t take this to mean that I think all languages are OK; many programming languages lack in expressiveness, or are really ugly in a variety of ways.)

So in the spirit of a rant I spent much more time on a side show. Ah well… The point is: you can do poor typesetting in LaTeX too, just as you can in MS-Word. The difference is that in LaTeX you have to work hard to do poor typesetting, while in Word you cannot do good typesetting. Some day the word processors might add some of the important points of typesetting (ligatures, intelligent spacing, good math formulae), at which point it will become possible to do good typesetting, but it will still be very difficult.

If you want to see more on this, including a lot of rants on how bad word processors are, a web search for the title of this posting gives some funny ones.

Aside | Posted on by | Leave a comment

starting with hdf5 — writing a data cube in C

pre(r)amble

Let us start using hdf5 to write out a “data cube” — in our example a series of scalar values f(x, y, z).

hdf5 is a library written in C but it can be called from many programming languages.  We will start by creating and writing a data cube from a C program (don’t worry, I will also show you a python version in another post), and then we will discuss various ways of examining this data cube. Edit a file, for example make-sample-cube.c, and start with:

the code

First include these .h files:

#include <hdf5.h>
#include <hdf5_hl.h>

This gives you access to the API. We will also need some other include files

We will need some more include files:

#include <stdlib.h>
#include <math.h>
#include <assert.h>

Let’s go top-down and write a main() routine that shows what we want to do in this program. We will have a fill_cube() routine which creates our data cube from mathematical functions, and a write_cube() routine which saves them to disk as hdf5 files. Since we’re taking a top-down approach, we declare their prototypes before the main() function:

double *fill_cube(int nx, int ny, int nz);
void write_cube(double *data, int nx, int ny, int nz, char *fname);

then the actual main():

int main(int argc, char *argv[])
{
  double *data_cube;
  int nx, ny, nz;
  nx = 64;
  ny = 57;
  nz = 50;
  char *fname = "sample_cube.h5";
  data_cube = fill_cube(nx, ny, nz);
  write_cube(data_cube, nx, ny, nz, fname);
  free(data_cube);
  return 0;
}

Now let’s write the functions we use:

/** 
 * Fills up a cube with artificial data so that the various slices
 * will be interesting to examine and visualize.  Note that this
 * routine uses malloc() to allocate memory for the cube, so the
 * caller has to use free() when it has finished using this cube.
 * 
 * @param nx 
 * @param ny 
 * @param nz 
 * 
 * @return the data cube, with space freshly allocated by malloc()
 */
double *fill_cube(int nx, int ny, int nz)
{
  int i, j, k;
  double x, y, z;
  double *data;
  data = (double *) malloc(nx*ny*nz*sizeof(*data));
  for (i = 0; i < nx; ++i) {
    for (j = 0; j < ny; ++j) {
      for (k = 0; k < nz; ++k) {
        x = (i - nx/2)/100.0;
        y = (j - ny/2)/100.0;
        z = (k - nz/2)/100.0;
        double val = (exp((-(z)*(z)/1.0
                           -(y)*(y)/1.0)
                          * (i+1)*(i+1)/10.0)
                      * (x+1)*(x+1));
        data[i*ny*nz + j*nz + k] = val;
      }
    }
  }
  return data;
}

Finally the write_cube() routine:

/** 
 * Writes a simple data cube out to disk as an hdf5 file
 * 
 * @param data 
 * @param nx 
 * @param ny 
 * @param nz 
 * @param fname 
 */
void write_cube(double *data, int nx, int ny, int nz, char *fname)
{
  /* now the various steps involved in preparing an hdf5 file */
  hid_t file_id;
  /* at this time our basic file just has the values strung out,
     so the hdf5 rank is 1 */
  hsize_t dims[3] = {nx, ny, nz};
  herr_t status;

  /* create a HDF5 file */
  file_id = H5Fcreate(fname, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
  /* create and write a double type dataset named "/datacube" */
  status = H5LTmake_dataset(file_id, "/datacube", 3, dims,
                            H5T_NATIVE_DOUBLE, data);
  assert(status != -1);
  /* add some hdf5 attributes with the metadata we need */
  H5LTset_attribute_int(file_id, "/datacube", "nx", &nx, 1);
  H5LTset_attribute_int(file_id, "/datacube", "nx", &nx, 1);
  H5LTset_attribute_int(file_id, "/datacube", "ny", &ny, 1);
  H5LTset_attribute_string(file_id, "/datacube", "x_units", "m");
  H5LTset_attribute_string(file_id, "/datacube", "y_units", "kg");
  H5LTset_attribute_string(file_id, "/datacube", "z_units", "degK");

  status = H5Fclose (file_id); assert(status != -1);
  assert(status != -1);
}

compiling and running

You will need the hdf5 library, which you can get on an Ubuntu or Debian system with:

sudo apt-get -u install libhdf5-serial-dev

You can compile this program with:

gcc -o make-sample-cube make-sample-cube.c -lm -lhdf5 -lhdf5_hl

You can run this program with:

./make-sample-cube

This creates a file called sample_cube.h5, and in a future post I will discuss various ways of looking at this data cube: slicing it and visualizing it.

discussion 1

Now I should find those honeyed words which allow you to wrap your head around the use of hdf5, understanding the broad strokes and the nuances at the same time. This is partly possible for this program, but you will soon learn that hdf5 goes into some brutal (and powerful) nitty gritty, at which point simple explanations miss the point entirely.

As usual with any library we #include the hdf5 .h files. Then we build a cube of data, and it’s worth mentioning in passing how we store the cube data in a C array:

C is clumsy for storing multidimensional arrays: you either hope you’re lucky and you have fixed size arrays (this happens when you’re first learning C, but not after). Otherwise you have to represent your multidimensional array using a C one-dimensional array, or do many nested memory allocations so that you can still use the syntax a[i][j][k].

After experimenting with various approaches people usually end up not using multiple memory allocations, but rather using a single one-dimensional array and index arithmetic to access the data in there. The routine fill_cube(), for example, uses this approach: what you might ideally represent as:

data[i][j][k] = val;

is instead represented as:

data[i*ny*nz + j*nz + k] = val;

this is called “row-major ordering” (where the last index is the one that varies most rapidly) and it is the convention used in C and many other languages (with the exceptions of FORTRAN and Matlab, which use column-major ordering). You probably don’t need to learn too much more, but you can read the wikipedia article on row-major ordering.

how the hdf5 writing is done

With that out of the way, let us see how we write the hdf5 file. hdf5 allows very complex and diverse ways of saving data. For example, you can have multiple datasets (these are almost always arrays of numbers) with all sorts of attributes (this is almost always metadata).

In this example we have a single dataset (our cube) and we start with very simple metadata/attributes (just the sizes and the units, which we have playfully set to be meters, kilograms and degrees Kelvin). This allows us to use the “hdf5 lite (H5LT)

The approach you can see in the code is that you take the following steps:

  1. Create an hdf5 file (it starts out empty).
  2. Make a dataset — this turns your array of data into an hdf5 “dataset”.
  3. Set some attributes (i.e. metadata).
  4. Close the file.

Note that there was no write() type of call: whatever you do with the data sets associated with an open file are guaranteed to be written out when you close the file.

discussion 2 with mild rant

If this were all you ever do with hdf5 I would say that the C interface to writing hdf5 files is a very good example of a C API. Unfortunately as soon as we go beyond this simple type of example (for example to read slices of data in efficient ways) the programming gets very detailed and ugly, and in my opinion this is one of the current design problems with hdf5: there is a nice API (the H5LT layer) for very simple cases (which you will very soon outgrow), and a powerful (but not nice) API for the full-blown library, but there is no intermediate level API for what most programmers will need most of the time.

Posted in data, rant, try this | Tagged , , , | 1 Comment

file formats for scientific data

I spent some time this last winter examining a variety of file formats for storing scientific data.  I also organized a “debate” in Los Alamos with people presenting various file formats.  I reached the following conclusions:

  • For itsy-bitsy-teeny-weeny data files, or configuration information, CSV (comma separated values) is OK.  There is not a clear standard, but reader libraries exist for many languages and if your file starts getting parsed incorrectly because of lack of standardization then you should move to another format!
  • For small files, less than 100 megabytes, I like ascii files with “structured comment metadata” (more below).
  • For medium-large files, 100 megabytes to 10 gigabytes, I like hdf5 with simple and somewhat ad-hoc approaches to metadata.
  • For very large data sets you have to start using a database approach, and there is great loss of innocence (i.e. introduction of complexity) here.  Current SQL databases have sad limitations vis-a-vis scientific data, but are probably required for effective searches.  A hybrid approach of and SQL database pointing to hdf5 for mass storage might work. PostgreSQL is probably the best current choice for scientific data.
  • scidb is designed very carefully to address these limitations in SQL databases.  It’s a work in progress and not yet ready for robust deployment.

There are also formats whose adoption is a historical mistake and these formats should not be used to store scientific data.  People in the fields in which these formats are used should be leading their communities to move away from those formats:

  • FITS: this format is widely used in astronomy and it has very severe limitations which make it an albatross around the neck of astronomical data work.  Sometimes you have a poor format with a nice way for programs to read that format, and you can put up with it, but in the case of FITS it has a poor file format and the reference API for file access (CFITSIO) is about as bad a library interface as I have seen.
  • CDF: this was NASA’s original attempt at a proper data format.  It was an OK design for its time, but it suffers from old design and lack of development momentum.  Its portion of the data storage world should migrate to hdf5.

So how do you lead a community away from a file format?  Arguing against it does not help too much — what will make the difference is when a good hacker on a project that uses FITS presents a parallel way of doing FITS and hdf5.  There is a fits2hdf script, and if all astronomy programs can be adapted to read the hdf5 file as well as the fits file then it’s a good start.

That way you show your collaborators that you are not requiring them to make a sudden change, and they can look at the new approach while not having been forced out of their comfort zone.

Related posts:

starting with hdf5 — writing a data cube in C

Posted in data, meta | Tagged , , , , , | 3 Comments

can anyone believe your results? (reproducibility)

Here are things I have heard working scientists say:

  • I made the plot in excel
  • I played with the plot in excel until it looked good
  • I’m not sure how I got from the raw data to the plot
  • I’m not sure which data collection run gave me that data file
  • my collaborator sent me that data file in an email attachment; I fixed a problem and sent it back to him in an email attachment

I have the impression that a lot of papers I look at have results that are not reproducible.  If the papers made a claim to a major discovery with great practical consequence, they would be examined very closely and might shown to be unreproducible.

The Wikipedia article on Cold Fusion has an instructive and sadly amusing blow-by-blow description of the circus that followed the 1988 announcement that Fleischmann and Pons had found the solution to the world’s energy problems with cold fusion.  This was not a typical boring and pointless paper, so people all over the world tried to reproduce their experiment, and it turns out there was nothing there, although interestingly some groups proved that wishful thinking can be quite powerful.

Much has been written about Fleischmann and Pons, but I have not found if their problem was lack of reproducibility (so they found a plot that looked good and just focused on that) or if there was something darker at work.  (Does anyone know the inner story on this?)

But what is quite clear is that if your bit of research is important then it will be scrutinized and you had better have a clear reproducible trail leading from the experiment (or simulation), through the various stages of calibration adjustment and analysis, up to the plots in the paper.  If you do not have  an automatic way of doing all that, then you will be embarrassed by this scrutiny.  If your collected data cannot be automatically linked (probably via metadata) to the exact instrument configuration at the time of collection, then you do not really have a scientific result: you have something suggestive but not convincing.

A good example I have seen of this rigor for a large scale project is Andy Fraser’s book on Hidden Markov Models.  You can download the book and what you get is a source code archive — you run the compilation scripts and they build the book, including running all the programs which generate the data for the plots, and then generate the plots themselves.  Dependencies are tracked: if the time stamp on a file is changed, everything that requires that file’s information will be rebuilt.  (Yes, it uses “make”.)

In an experimental setting this is even more important: raw data is taken once, but the information used to process that raw data might be updated, at which point the raw data has to be turned into finished data products automatically, without human intervention.  This is often not done or left until later (i.e. never).

I think that the solution to this problem involves a cocktail of the following, which really mesh with each other:

  • availability of raw data
  • availability of all processing codes
  • provenance
  • metadata
  • version control
  • software pipelines

I plan to discuss how these ideas play into good reproducible science and how one should program to guarantee reproducibility.

… some vaguely related links:

Carlo Graziani’s article on Ed Fenimore and honesty in science (in particular his link on the Ginga lines)

Discussions of the Fermilab tevatron mirage

Blas Cabrera’s possible detection of a magnetic monopole in 1982

The Atlantic Monthly’s article Lies, Damned Lies, and Medical Science

Posted in meta, rant | Tagged | 1 Comment