R Surprises

My favourite programming language is Mercury and I have many good reasons for the choice. Sometimes I need to interface tools implemented in other languages that I know only very superficially. One of them is R and the number of 'gotchas' has exceeded my threshold. Instead of smashing things, I'll to write about it.

The name of R was chosen badly. And so are other names in R-Project.

R is a successor of S, therefore the name (please ignore the fact that R comes before S in the alphabet). The choice was done before the web was googled, which might slightly recover authors' choice. Nowadays, you have to add the very confusing and ambiguous term "R" to all your google queries and hope that somewhere near the top, some of the mildly relevant results show up...

See below for other bad naming decisions: strings are called characters, fundamental data type acronym (SXP) suggests plural where there no plural ment (REALSXP).

R is documented by random examples.

R has an extensive documentation. Loads of pages (manual pages, 'info' (texinfo) pages, html/pdf/ps versions of the manuals, built-in help, gentle introductions, random users' comments, mailing list archives) have been written. Unfortunately, none of the contributors felt to have the authoritative power combined with nitpicking struggle, so most of the documents are rather partial comments than full descriptions. Same holds for the source code.

Citing manual, emphasis is mine:

The usual R object modes are given in the table... ... Among the important internal SEXPTYPEs are LANGSXP, CHARSXP, PROMSXP, etc.

In other words, the "authoritative source" lists some, not all, types defined, some of which are further classified as important, but explicitly suggesting that there are other important types that have not been marked so.

Citing from /usr/lib/R/include/Rinternals.h:

/* Fundamental Data Types:  These are largely Lisp
 * influenced structures, with the exception of LGLSXP,
 * INTSXP, REALSXP, CPLXSXP and STRSXP which are the
 * element types for S-like data objects.

 * Note that the gap of 11 and 12 below is because of
 * the withdrawal of native "factor" and "ordered" types.
 *			--> TypeTable[] in ../main/util.c for  typeof()

#define NILSXP	     0	  /* nil = NULL */
#define SYMSXP	     1	  /* symbols */
#define LISTSXP	     2	  /* lists of dotted pairs */
#define CLOSXP	     3	  /* closures */
#define ENVSXP	     4	  /* environments */
#define PROMSXP	     5	  /* promises: [un]evaluated closure arguments */
#define LANGSXP	     6	  /* language constructs (special lists) */
#define SPECIALSXP   7	  /* special forms */
#define BUILTINSXP   8	  /* builtin non-special forms */
#define CHARSXP	     9	  /* "scalar" string type (internal only)*/
#define LGLSXP	    10	  /* logical vectors */
#define INTSXP	    13	  /* integer vectors */
#define REALSXP	    14	  /* real variables */
#define CPLXSXP	    15	  /* complex variables */
#define STRSXP	    16	  /* string vectors */
#define DOTSXP	    17	  /* dot-dot-dot object */
#define ANYSXP	    18	  /* make "any" args work.
			     Used in specifying types for symbol
			     registration to mean anything is okay  */
#define VECSXP	    19	  /* generic vectors */
#define EXPRSXP	    20	  /* expressions vectors */
#define BCODESXP    21    /* byte code */
#define EXTPTRSXP   22    /* external pointer */
#define WEAKREFSXP  23    /* weak reference */
#define RAWSXP      24    /* raw bytes */
#define S4SXP       25    /* S4, non-vector */

#define FUNSXP      99    /* Closure or Builtin or Special */

To make some order in this mess, here is a summary of my observations:

Object of type... ...stores a...
CHARSXP vector of character (aka char) values
LGLSXP vector of logical (aka Boolean!) values
INTSXP vector of integer values
REALSXP vector of floating point (double? precision) values
VECSXP vector of pointers

In R, there are no scalars, you always need vectors.

The short story is: if you need a 'float', allocate a vector of floats of length 1 and put the number there. There is an exception at least for strings (a string scalar is internally a character vector).

There is probably a longer story but it remains hidden to me.

In R, Strings are called Characters.

I'm still not sure if this title is correct, or if 'strings are called chars' is a myth that stems from interpreting the obscure #define of CHARSXP (with yet a bit more cryptic comment: "scalar" string type (internal only)).

Experiments show that CHARSXP represents a vector of characters, which has a natural interpretation of a string. Whether the string is null-terminated, or whether the vector length is actually used, remains unclear. I suspect, both or at least one of the cases might hold in some parts of R core and add-on packages' code...

Further experiments show, that there are two macros to allocate and set strings (mkString and mkChar). The following two code snippets document that the macros are not interchangeable, although the real difference is beyond my comprehension:

    /* create a vector of two strings */
    SEXP e = allocVector(STRSXP, 2);
    SEXP s1 = mkChar("hello");
    SEXP s2 = mkChar("world");
        /* yes, crazy R developers call a string a 'Char' */
        /* mkString instead of mkChar would lead to this runtime error:
	 * Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'character' */
    SET_STRING_ELT(e, 0, s1);
    SET_STRING_ELT(e, 1, s2);

    /* call this R command: source("FileName") */
    int errorOccurred;
    SEXP e = lang2(install("source"), mkString("FileName"));
        /* mkChar instead of mkString would lead to this runtime error:
	 * Error in source(FileName) : unimplemented type 'char' in 'eval' */
    R_tryEval(e, R_GlobalEnv, &errorOccurred);

R tries hard to hide all errors you make, making debugging more fun.

Type coercion

R automatically coerces objects to what other functions need (and random R contributors felt is a reasonable coercion). When interfacing R from C, no coercion is done for you, though, so you have to try hard to figure out, what exactly does an R function you're calling need.

Sometimes, though, what appear same on the surface, differs in behaviour. And very often, the inner representation is changed without you knowing about that:

> w <- t(read.table(param.file, header=F, row.names=1))[1,]
> w
   d   lm  tm1  tm2  tm3  tm4  tm5    w 
 0.2  0.5  0.2  0.2  0.2  0.2  0.2 -1.0 
> is.data.frame(w)
> is.matrix(w)
> is.array(w)
> is.table(w)
> is.vector(w)
[1] TRUE
> w["scale"]
  ## ie. asking a vector for non-existing column returns NA
> d<-data.frame(w)
> is.data.frame(d)
[1] TRUE
> d["scale"]
Error in `[.data.frame`(d, "scale") : undefined columns selected
  ## but asking a dataframe for non-existing column is a runtime error
> d2<-t(d)
> is.data.frame(d2)
> is.vector(d2)
> is.matrix(d2)
[1] TRUE
  ## transposition made the original dataframe 'd' into a matrix
> d2["scale"]
[1] NA
  ## ..so asking for non-existing column now returns NA

Garbage collection

Unclear documentation

R uses a garbage collector. The documentation of memory management is not meant for people who write bindings from other languages, it is for people who write R extensions in C. Therefore, the documentation provides only examples where some data are 'protected' against garbage collection, some manipulation is done, and all the protections are then released. When R is used in an embedded fashion from another language, one would need to know more general guidelines for protecting: if I create an R object (using R interface macros) and want to hold a reference to in during some other my-language manipulations, possibly creating further R objects. I wish to call an R function on these only later. So how should I protect and unprotect my objects for such scenarios? The manual explicitly discourages protecting every object I create and unprotecting all of them at the end. But if I leave my objects unprotected, they will get garbage collected while doing the other allocations...

My summary/re-explanation

All R objects have to be PROTECTed until used somewhere (in a function call or as a part of another (protected) object), otherwise R's garbage collector will dispose of them. All R objects need to be UNPROTECTed (by UNPROTECT_PTR or UNPROTECT) once used somewhere, otherwise the garbage collector's fixed stack of ~10000 objects will overflow.

The correct approach therefore is: right after allocating any sexp, protect it. Right after using any sexp, e.g. assigning it to a place in a vector or using as an input argument to an R function call, unprotect it (because the outer structure holds the protection and because function arguments are protected by definition). Care has to be taken when unprotecting, once unprotected, an object must not be unprotected again. If you were familiar to Mercury's mode system and uniqueness, you could think of 'unprotection' as a destructive use of the object.

A speedup is possible if structures are built outside-in (a hint clearly missing in the R manual). If structures are build outside-in, most calls to PROTECT can be saved: the main sexp needs to be created protected so that allocation of 'subobjects' does not harm it. Once a new sub-object is allocated, it goes right away into the main sexp, making use of the main sexp's protection. If any R allocation could happen before the inner object is incorporated to the main one, one needs to protect the inner object, too.

Allocation/protection errors are generally obscured

Moreover, R coercion obscures all protection errors, because garbage collected objects are probably recognized as NULL and NULL is probably coerced into NA (not-available) or into an empty vector/matrix. And an empty vector is treated as a valid vector of length 0, until e.g. some matrix multiplication fails...

Fortunately, calling gctorture(on=TRUE) allows you to observe your errors a bit earlier, although never at the place where you actually made them.

[Ondrej Bojar - OBOproduct] [Mail Me] [Finger Me] $Date: 2007/09/23 12:56:25 $