My favourite programming language is Mercury and I have many good reasons for the choice. Sometimes I need to interface tools implemented in other languages that I know only very superficially. One of them is R and the number of 'gotchas' has exceeded my threshold. Instead of smashing things, I'll to write about it.
R is a successor of S, therefore the name (please ignore the fact that R comes before S in the alphabet). The choice was done before the web was googled, which might slightly recover authors' choice. Nowadays, you have to add the very confusing and ambiguous term "R" to all your google queries and hope that somewhere near the top, some of the mildly relevant results show up...
See below for other bad naming decisions: strings are called characters, fundamental data type acronym (SXP) suggests plural where there no plural ment (REALSXP).
R has an extensive documentation. Loads of pages (manual pages, 'info' (texinfo) pages, html/pdf/ps versions of the manuals, built-in help, gentle introductions, random users' comments, mailing list archives) have been written. Unfortunately, none of the contributors felt to have the authoritative power combined with nitpicking struggle, so most of the documents are rather partial comments than full descriptions. Same holds for the source code.
Citing manual, emphasis is mine:
The usual R object modes are given in the table... ... Among the important internal SEXPTYPEs are LANGSXP, CHARSXP, PROMSXP, etc.
In other words, the "authoritative source" lists some, not all, types defined, some of which are further classified as important, but explicitly suggesting that there are other important types that have not been marked so.
Citing from /usr/lib/R/include/Rinternals.h
:
/* Fundamental Data Types: These are largely Lisp * influenced structures, with the exception of LGLSXP, * INTSXP, REALSXP, CPLXSXP and STRSXP which are the * element types for S-like data objects. * Note that the gap of 11 and 12 below is because of * the withdrawal of native "factor" and "ordered" types. * * --> TypeTable[] in ../main/util.c for typeof() */ ... #define NILSXP 0 /* nil = NULL */ #define SYMSXP 1 /* symbols */ #define LISTSXP 2 /* lists of dotted pairs */ #define CLOSXP 3 /* closures */ #define ENVSXP 4 /* environments */ #define PROMSXP 5 /* promises: [un]evaluated closure arguments */ #define LANGSXP 6 /* language constructs (special lists) */ #define SPECIALSXP 7 /* special forms */ #define BUILTINSXP 8 /* builtin non-special forms */ #define CHARSXP 9 /* "scalar" string type (internal only)*/ #define LGLSXP 10 /* logical vectors */ #define INTSXP 13 /* integer vectors */ #define REALSXP 14 /* real variables */ #define CPLXSXP 15 /* complex variables */ #define STRSXP 16 /* string vectors */ #define DOTSXP 17 /* dot-dot-dot object */ #define ANYSXP 18 /* make "any" args work. Used in specifying types for symbol registration to mean anything is okay */ #define VECSXP 19 /* generic vectors */ #define EXPRSXP 20 /* expressions vectors */ #define BCODESXP 21 /* byte code */ #define EXTPTRSXP 22 /* external pointer */ #define WEAKREFSXP 23 /* weak reference */ #define RAWSXP 24 /* raw bytes */ #define S4SXP 25 /* S4, non-vector */ #define FUNSXP 99 /* Closure or Builtin or Special */
To make some order in this mess, here is a summary of my observations:
Object of type... | ...stores a... |
---|---|
CHARSXP | vector of character (aka char) values |
LGLSXP | vector of logical (aka Boolean!) values |
INTSXP | vector of integer values |
REALSXP | vector of floating point (double? precision) values |
VECSXP | vector of pointers |
The short story is: if you need a 'float', allocate a vector of floats of length 1 and put the number there. There is an exception at least for strings (a string scalar is internally a character vector).
There is probably a longer story but it remains hidden to me.
I'm still not sure if this title is correct, or if 'strings are called chars' is a myth that stems from interpreting the obscure #define of CHARSXP (with yet a bit more cryptic comment: "scalar" string type (internal only)).
Experiments show that CHARSXP represents a vector of characters, which has a natural interpretation of a string. Whether the string is null-terminated, or whether the vector length is actually used, remains unclear. I suspect, both or at least one of the cases might hold in some parts of R core and add-on packages' code...
Further experiments show, that there are two macros to allocate and set strings (mkString and mkChar). The following two code snippets document that the macros are not interchangeable, although the real difference is beyond my comprehension:
/* create a vector of two strings */ SEXP e = allocVector(STRSXP, 2); SEXP s1 = mkChar("hello"); SEXP s2 = mkChar("world"); /* yes, crazy R developers call a string a 'Char' */ /* mkString instead of mkChar would lead to this runtime error: * Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'character' */ SET_STRING_ELT(e, 0, s1); SET_STRING_ELT(e, 1, s2); /* call this R command: source("FileName") */ int errorOccurred; SEXP e = lang2(install("source"), mkString("FileName")); /* mkChar instead of mkString would lead to this runtime error: * Error in source(FileName) : unimplemented type 'char' in 'eval' */ R_tryEval(e, R_GlobalEnv, &errorOccurred);
R automatically coerces objects to what other functions need (and random R contributors felt is a reasonable coercion). When interfacing R from C, no coercion is done for you, though, so you have to try hard to figure out, what exactly does an R function you're calling need.
Sometimes, though, what appear same on the surface, differs in behaviour. And very often, the inner representation is changed without you knowing about that:
> w <- t(read.table(param.file, header=F, row.names=1))[1,] > w d lm tm1 tm2 tm3 tm4 tm5 w 0.2 0.5 0.2 0.2 0.2 0.2 0.2 -1.0 > is.data.frame(w) [1] FALSE > is.matrix(w) [1] FALSE > is.array(w) [1] FALSE > is.table(w) [1] FALSE > is.vector(w) [1] TRUE > w["scale"]## ie. asking a vector for non-existing column returns NA > d<-data.frame(w) > is.data.frame(d) [1] TRUE > d["scale"] Error in `[.data.frame`(d, "scale") : undefined columns selected ## but asking a dataframe for non-existing column is a runtime error > d2<-t(d) > is.data.frame(d2) [1] FALSE > is.vector(d2) [1] FALSE > is.matrix(d2) [1] TRUE ## transposition made the original dataframe 'd' into a matrix > d2["scale"] [1] NA ## ..so asking for non-existing column now returns NA
R uses a garbage collector. The documentation of memory management is not meant for people who write bindings from other languages, it is for people who write R extensions in C. Therefore, the documentation provides only examples where some data are 'protected' against garbage collection, some manipulation is done, and all the protections are then released. When R is used in an embedded fashion from another language, one would need to know more general guidelines for protecting: if I create an R object (using R interface macros) and want to hold a reference to in during some other my-language manipulations, possibly creating further R objects. I wish to call an R function on these only later. So how should I protect and unprotect my objects for such scenarios? The manual explicitly discourages protecting every object I create and unprotecting all of them at the end. But if I leave my objects unprotected, they will get garbage collected while doing the other allocations...
All R objects have to be PROTECTed until used somewhere (in a function call or as a part of another (protected) object), otherwise R's garbage collector will dispose of them. All R objects need to be UNPROTECTed (by UNPROTECT_PTR or UNPROTECT) once used somewhere, otherwise the garbage collector's fixed stack of ~10000 objects will overflow.
The correct approach therefore is: right after allocating any sexp, protect it. Right after using any sexp, e.g. assigning it to a place in a vector or using as an input argument to an R function call, unprotect it (because the outer structure holds the protection and because function arguments are protected by definition). Care has to be taken when unprotecting, once unprotected, an object must not be unprotected again. If you were familiar to Mercury's mode system and uniqueness, you could think of 'unprotection' as a destructive use of the object.
A speedup is possible if structures are built outside-in (a hint clearly missing in the R manual). If structures are build outside-in, most calls to PROTECT can be saved: the main sexp needs to be created protected so that allocation of 'subobjects' does not harm it. Once a new sub-object is allocated, it goes right away into the main sexp, making use of the main sexp's protection. If any R allocation could happen before the inner object is incorporated to the main one, one needs to protect the inner object, too.
Moreover, R coercion obscures all protection errors, because garbage collected objects are probably recognized as NULL and NULL is probably coerced into NA (not-available) or into an empty vector/matrix. And an empty vector is treated as a valid vector of length 0, until e.g. some matrix multiplication fails...
Fortunately, calling gctorture(on=TRUE)
allows you to observe your errors a bit earlier, although never at the place where you actually made them.