9 December 13.

### On Julia

A colleague talked me into trying Julia. My resistance was based on two immediate objections.

First, do we really need a new language?

Second, do we really need a new domain-specific language? DSLs are usually reserved for very specific problems, like the regular expression sublanguage used to describe text expressions with regularities; the XPath sublanguage used to find bits of XML; SQL for talking to tables. A good DSL knows its place. Even if your flavor of SQL is Turing Complete, using it to write programs that do more than subset and join tables is little more than an interesting exercise.

Conversely, now and then you find a DSL that went too far. Matlab and R were originally built around certain types of workflow, explicitly described in early documentation (in the case of S and R, see Chambers's book; for Matlab see this discussion of Matlab's early history). Within the field, people try to apply the DSL to expanding fields of use, and shoehorning happens. By the way, shoehorns are great. Using one is incredibly smooth and just feels good. Yet the word shoehorning has negative connotations. Across fields, having multiple DSLs expanding their scope is the setup for little battles: will the language used by traditional statisticians expand its empire into fields where everybody is using Matlab? Will the language electrical engineers know and rely on be able to win over statisticians? The result is Balkanization and isolation for no reason.

After all, here is the procedure for building the solution to any problem in code:

• Break down your problem into subproblems.
• Find the existing code that solved each subproblem.
• Duct tape them together to solve the main problem.

If all the favorite electrical engineering procedures are in one language and all the regression procedures are in another, then findability and composability suffer, and people waste time rewriting the same thing in each subfield.

So the first reason I'm trying Julia is that my colleague convinced me that Julia has pretentions toward a general-purpose language that happens to be good with numeric code. The culture behind Julia, by my reading, pushes reuse and reinvention and is written from the ground up to make that easy (as the manual's introduction explicitly states, as of this writing). Julia impresses me as an effort to provide a common front-end, as inclusive rather than exclusive.

One of my motivations in writing Apophenia in C was that just about every language has a hook to call C code, so if we have a reliable library of statistics functions in C, then people writing new systems could link to that rather than rewriting everything yet again. Some systems make this linking easier than others. With Julia, it is really easy to call C code. Apophenia has two types to speak of: apop_data and apop_model. Build conduits for those two types, and everything else falls into place. With very little knowledge of Julia, writing a working wrapper for apop_data took me about half an hour. I spent an hour or two doing that this should work why doesn't it work oh wait lemme try this sort of debugging with the apop_model, but throughout I had the sense that I was making progress and it was just a question of having 100% of the details right. Next time, I'll demo the results.

It was a downright wonderful experience compared to writing wrappers for another DSL, and even producing a Python wrapper using Swig, the general-purpose wrapper generator (which has a couple of undocumented eccentricities that I ran into head-on). It didn't assume that C code consists exclusively of loose arrays of int, float, and char* (even though that's how C textbooks from the 1980s characterize the language's use). The documentation was clear and immediately available, and at no point did it hint that if I were a better person I wouldn't be using C at all and would be doing everything in Julia.

Julia's package management system is named Git. A central list keeps the location of the repositories, which are in a predefined format. Once again, reuse saves the day. I've blogged on the Balkanization of package managers before, and because Git is neutral ground, I have some hope that it could perhaps become the infrastructure for a cross-language package system. We'll see where Julia goes and whether it bureaucratizes with age, but at the moment, you don't have to go through the oppposite-of-open-source process of satisfying a committee to have a package distributed widely.

As for the details of the language, like the well-thought-out type system, the Matlab-like block statements which everybody (myself included) would rather see changed to Python-like block statements, and other such minutiæ, I'll refrain from comment. Also, I still like being able to exclaim, when I have a new problem, Do it to Julia!, though I expect that novelty will wear off eventually.

As a postscript, the thing that makes all of this possible is LLVM, a language-agnostic compiler for which it is relatively easy to write new front-ends for new languages. LLVM is already taking over the world, and its ability to cross languages also gives me hope for the future.

Next time: the aforementioned draft of a Julia wrapper, and a demo re-running an earlier sample program in Julia.

7 October 13.

### Another two-line Logit analysis

Today's post will not be especially intense: I point to a researcher doing some interesting things, complain about nomenclature, and give you a two-line example of a binomial model + logistic link using Apophenia.

Paul Gribble's lab studies the computational aspects of motor control. For example, his lab has a few papers on learning motor control by watching others. And here's an old pop writeup of this paper (PDF) on learning movement by observing. It's not my field, so I just read the abstracts and admire the march of science.

One of Paul's little side projects is a routine to estimate a Psychometric function, aka a binomial model + logit link function, aka a Logit.

So first, allow me a digression about nomenclature. A rose by any other name may smell as sweet, but it is more difficult to locate in the literature.

• psychometric function is a field-specific name, which leads to Balkanization, as every field that uses a Logit could develop a different name for the same thing.

• Binomial model + logistic link is a description of a specific form of the model. Using Paul's notation: $$y = \beta_0 + (\beta_1 * x)$$ $$p(x) = Pr(response|x) = 1 / (1 + exp(-y))$$ where $y$ is the observed outcome in the data, $x$ is the observed input, and the $\beta$s are parameters to be estimated. This is a reminder that we could change that second function to any of a number of other probability functions, another favorite being a simple Normal CDF (the Probit). Calling it a binomial model is a reminder that there are two options for the response. So this naming scheme generalizes somewhat, but it is also a certain way of thinking about the problem. For my tastes, it doesn't generalize far enough, being that it still presumes a fundamentally linear process, but I've already written that ten-part series of blog posts.

• Logit refers to a specific form, and doesn't really describe the model, except for the cutesy custom that models ending in -it (Logit, Probit, Tobit, Normit, …) are about categorical data. But Logit is just a vocabulary term, like magma or Markov chain, which you have to memorize. I think the name Logit came before the GLM-style binomial+logistic link form (where GLM=Generalized linear model).

I asked a librarian of Congress. She was furloughed and had nothing else to do. The task of a librarian is to store information so that it is easy to retrieve, meaning that your typical librarian is an expert in taxonomy and ontology. Of these options, she recommended the functional naming scheme, under the rationale that somebody who has no idea could conceivably find it. She suggested that the other options are less discoverable.

Naming something after a certain framework can be risky; we sometimes wind up with terms that made sense to the first author but are now a little confusing: regression, or Gamma distribution (which is based on, but does not behave like, the Gamma function). But digressions from the original meaning happen in every field (sneakers, cab, office, even computer).

Anyway, having identified the psychometric function Paul estimates as a Logit, it's pretty easy to estimate. Here's a shell script to cut and paste onto your command line, which will download his version and sample data and run it through Apophenia's Logit estimator.


mkdir logit_eg
cd logit_eg
git clone https://github.com/paulgribble/psychometric.git

cat >> makefile << '——'
CFLAGS =-g -Wall -O3  pkg-config --cflags apophenia
LDLIBS=pkg-config --libs apophenia
CC=c99
171-a_logit:
——

cat >> 171-a_logit.c << '——'
#include <apop.h>

int main(){
apop_text_to_db("psychometric/exdata", "d", .delimiters=" ", .has_col_names='n', .field_names=(char*[]){"measure", "outcome"});
apop_model_show(apop_estimate(apop_query_to_data("select outcome, measure from d"), apop_logit));
}
——

make && ./171-a_logit


The results are the same, though the methods of calculation are a little different. The Logit objective function is globally concave, which makes it a good candidate for gradient descent-type MLE search algorithms. Also, the Gribble psychometric edition adds some additional diagnostics and has some nice plots. Somebody could easily add some of those to Apohenia's Logit estimation routine….

26 September 13.

### Interrater agreement via entropy

Last time, I wrote about interrater agreement, in which we seek a measure of the agreement rate between two people categorizing the same set of observations. We want it to be complexity-adjusted, because two raters putting data into two categories can achieve 50% agreement by flipping coins, while 50% agreement on a task with a hundred fine-distinction categories may be an impressively high agreement rate.

The literature I described did this via model-based means: set up a scenario where categories are assigned at random, and compare the observed agreement rate with that at-random rate. The problem is that this opens the door to a dozen different means of defining the at-random scenario. I gave the distinction between Scott's Pi and Cohen's Kappa as an initial example, but the literature goes on: what if the task is putting a continuous observation into discrete bins, so if rater one picked 50, rater two is very likely to pick 49, 50, or 51? What if one rater abuses the other category, but the raters are otherwise always in agreement? Would it be a reasonable model to presume that raters are accurate $k$% of the time, but flub and draw a random category $1-k$% of the time? Andrés and Marzo (2004) propose this model.

When I first encountered this literature, it brought to mind entropy, which is often informally billed as a measure of the complexity of the system. It's also billed as a measure of the information in a system: where entropy is large (e.g., a hundred equiprobable bins), then each new draw provides new information; where entropy is near zero (e.g., only one category, or 90% of type A and 10% of type B), we know what the next draw will probably be before seeing it.

The entropy of a probability mass function with $i$ categories is defined as as $H =\sum_i -p_i \log_2(p_i)$. If a bin has $p_i=0$, we define that bin to have zero contribution to the entropy (which is the correct limit of $p_i \log_2(p_i)$ as $p_i\to 0$). Notice that the log of a number between zero and one is negative (or zero if $p_i=1$), so the negation is positive (or zero).

Entropists think of information as a real quantity, which can be measured like anything else. Its unit is the bit, which is short for binary digit. Notably, if there are two streams, there is a total entropy, and some amount of mutual information between them. Let $I(X, Y)$ be the mutual information; If I know the streams are in sync, then seeing one observation tells me exactly what the other stream is saying, so we should have that $H(X) = H(Y) = I(X, Y)$. If the streams are independent (knowing $X$ gives no information about $Y$), then we should have $I(X, Y)=0$.

The formula for mutual information given a set of categories is

$$I(X, Y) = \sum_j \sum_i p_{ij} \log_2\left(\frac{p_{ij}}{p_{\cdot{}j}p_{i\cdot}}\right),$$

where $p_{\cdot{}j}$ is the odds of category $j$ regardless of $i$, and similarly for $p_{i\cdot}$. We are looking for information in agreement, so define $IA(X, Y)$ to be this sum only over indices where $i=j$, throwing out those elements of the sum where $i\neq j$.

The raw percent agreement is $P_o\equiv$ (count of observations in agreement)/(total count of observations). By analogy, define

$$P_I(X, Y)\equiv \frac{IA(X, Y)}{(H(X)+H(Y))/2},$$

that is, the (information in agreement)/(mean total individual information).

This is a sensible measure of interrater agreement:

• It is one when it should be (full agreement).

• It is zero when it should be (agreement equal to the random rate, or no agreement at all). In the no-agreement case, Cohen's Kappa is negative. Exercise: prove that the range of Cohen's Kappa is from $-1/(B-1)$ to $1$, where $B$ is the number of bins.

• It is complexity-adjusting in the expected ways.

• If you are comfortable with the concept of measuring information as bits of entropy, then it is a simple and natural fraction of agreement-to-total, akin to $P_o$.

• It is model-independent: we used information accounting identities which work for any pair of distributions, so we entirely sidestep the debate about how to define agreement at random.

I wrote up $P_I$ and its properties in detail in this Journal of Official Statistics paper. I hate to be immodest, but I really do like $P_I$ better than the alternatives. As I discuss in the paper, there are natural extensions to ordered categories (by redefining $IA(X, Y)$ to accommodate near-misses) and to multiple raters (by extending the numerator to include information in agreement across all pairs and the denominator to mean total entropy for all pairs). I think that having a system that sidesteps the problem of developing a baseline is a real win. The payoff comes when writing the paper using the measure, because using a statistic designed around a baseline at-random model requires a paragraph or two explaining why the situation at hand is compatible with the assumptions underlying that model, and explaining why those assumptions are more appropriate than those for the models underlying other measures referees might be used to. Meanwhile, it is theoretically valid to measure the mutual information for any two streams of data.

Last time, I presented a code example that gave some code to calculate $P_I$ from the confusion matrix, and now would be a good time to remind you that all the code is available on GitHub. If you're not a C user, you should have no problem translating the C code into your favorite language. You are welcome to try it out and see if it does what an intercoder agreement index should.

22 September 13.

### Interrater agreement

This and the next entry are blog-level discussions of this Journal of Official Statistics paper.

I first met Cohen's Kappa maybe three years ago. I did a joint project with a friendly coworker who had spent decades dealing with the sort of data that requires such measures. After some amused discussion of our different paths and how I could spend years dealing with data and never meet it, yet she had to use agreement measures on every project, she apologized. Cohen's Kappa and the many related measures (Fleiss's Kappa, Scott's Pi, Krippendorff's Alpha, …) don't make a lot of sense, she told me, and don't seem to have much theoretical basis, but it's what everybody uses.

The basic problem: you have two sequences of observations about the same data. The typical situation is that there is non-categorical data, like free-response questions on a survey or a set of chest X-rays, and two or more raters independently decide into which category each observation falls, like a check-box summary or a diagnosis. We'd like to know how often the raters are in agreement.

If there are two categories, two raters flipping coins would agree 50% of the time. Given two categories that have an expected 90-10 split, we expect raters to be in agreement 82% of the time. Given 100 equiprobable categories, the expected agreement entirely at random is 1%. So it seems inequitable to compare raw categories, and it seems we should have some sort of complexity adjustment' to the raw observed rate of agreement to take into account how much room there is for agreement over the at-random case.

From here, complications ensue. Below, I'll discuss the difference in baseline calculations between Cohen's Kappa and Scott's Pi (and define both at the same time). Or what if the categories are ordered; should a near-miss count for something? What if some categories are really hard to identify and others are very easy to identify? And, for that matter, why are we taking two people drawing values at random as a baseline at all?

That's the state of the literature: everybody has a different baseline concept of randomness, which implies that everybody has a different measure of the distance to randomness. For example, this paper (PDF), sent to me by Dr MH of SF, CA, characterizes the entire interrater agreement literature as a big mess, with too many answers to the same question.

###### The math

The calculations for these things all begin by generating what I've been referring to as a crosstab, but many in this literature call the confusion matrix, (from the Latin prefix con-, meaning with, melded with the root word fusion). Cell (1,1) is the percent of times both raters chose category one, (1, 2) is the percent of times the row rater chose 1 and the column rater chose 2, and so on.

I'll be assuming a normalized confusion matrix from here on in, where the grand total of all cells is exactly one. That means that having large or small $N$ is irrelevant to the measurements.

The basic formula for both Cohen's Kappa and Scott's Pi is

$$\frac{P_o - P_e}{1 - P_e}.$$

Here, $P_o$ is the observed percent agreement—the sum of the main diagonal of the confusion matrix. Cohen finds the expected percent agreement for category one by multiplying the percent of times the row rater chose category one by the percent of times the column rater chose category one (i.e., the total weight in row one times the total weight in column one). Then $P_e$ is the sum of the percent agreement for every category. For Scott, we find the agreement for category one by finding the percent of all classifications (both for row and column) that went to category one, and squaring that.

That is, for Cohen, the row rater and column rater are distinct random number generators, with distinct odds of picking category one, and the at-random odds of agreement is the product of the two. For Scott, the row rater and column rater are both instances of a generic rater, and our best estimate of the generic rater's odds of picking category one is from pooling both row and column categorizations together; the at-random odds of agreement is then the square of the odds the generic rater picks a category.

So we already have two plausible models of the at-random null hypothesis. I have yet to see an argument advocating for one over the other that did not reduce to this one just seems more plausible to me than the other one. Within the two, the literature I have read seems to have a preference for Cohen's Kappa. I suspect this is because Cohen's Kappa is always larger than Scott's Pi, so you look like you have better raters using it. Proving this is left as an exercise for the reader.

The intuition for both measures largely works out: if there is full agreement, then $P_o=1$ and both measures are $(1-P_e)/(1-P_e)=1$ (given raters that used more than one category). If agreement at random is more likely, so $P_e$ is larger, then the same $P_o$ produces a smaller measure, so we are successfully complexity-adjusting $P_o$. If there is worse-than-random agreement, then these measures are negative, which is a little awkward; the usual advice in the literature (and I agree) is that if you get a negative Kappa, then your protocol is seriously broken and you should focus on why your raters can't agree rather than fretting over the exact value of a statistic.

###### The sample code

Below is code calculating both Scott's Pi and Cohen's Kappa. I'll use the Amash amendment data again, which I first used and described in this earlier entry. Here, the raters' are (1) votes cast on the amendment, and (2) the party affiliation of the voter. A claim that this was a party vote is a claim that the agreement between these two ratings should be high.

When you run this, you'll see that the rate of agreement between the first and second rating schemes is not all that much further from the coin-flip odds of 50%, so all of the rating schemes give the pair of rating systems a low agreement score.

What about campaign contributions by defense industry members? The second half of main divides voters into high- and low-contribution recipients. To improve the comparison with the above, we'd like a division that produces roughly the same bin counts as the split by party did. So the SQL first counts the Democrats (there are 194), and then finds a contribution cutoff that would produce 194 low-contribution voters. A complication: the correct contribution cutoff is \$13,000, but there are four people who got that level of contributions, so there's no way to get a perfect match between counts. To two decimal places, it makes no difference in the statistics into which bin we put these four voters. Then, we try the same measures using these two systems of categorization. It turns out that the campaign contributions give a marginally greater agreement rate on all measures calculated, though it is not enough to be a significant difference. To get variances for these measures, by the way, I recommend bootstrapping. Next time, I'll present the alternative measure of intecoder agreement that I developed based on entropy measures. As a preview, it's already calculated in the example code (as$P_I$).  #include "169-kappa_and_pi.h" apop_data *kappa_and_pi(apop_data const *tab_in){ apop_data *out = apop_data_alloc(); Apop_stopif(!tab_in, out->error='n'; return out, 0, "NULL input. Returning output with 'n' error code."); Apop_stopif(!tab_in->matrix, out->error='m'; return out, 0, "NULL input matrix. Returning output with 'm' error code."); Apop_stopif(tab_in->matrix->size1 != tab_in->matrix->size2, out->error='s'; return out, 0, "Input rows=%zu; input cols=%zu; " "these need to be equal. Returning output with error code 's'.", tab_in->matrix->size1, tab_in->matrix->size2); apop_data *tab = apop_data_copy(tab_in); double total = apop_matrix_sum(tab->matrix); gsl_matrix_scale(tab->matrix, 1./total); double p_o = 0, p_e = 0, scott_pe = 0, ia = 0, row_ent = 0, col_ent = 0; for (int c=0; c< tab->matrix->size1; c++){ double this_obs = apop_data_get(tab, c, c); p_o += this_obs; Apop_matrix_row(tab->matrix, c, row); Apop_matrix_col(tab->matrix, c, col); double rsum = apop_sum(row); double csum = apop_sum(col); p_e += rsum * csum; scott_pe += pow((rsum+csum)/2, 2); ia += this_obs * log2(this_obs/(rsum * csum)); row_ent -= rsum * log2(rsum); col_ent -= csum * log2(csum); } apop_data_free(tab); asprintf(&out->names->title, "Scott's π and Cohen's κ"); apop_data_add_named_elmt(out, "total count", total); apop_data_add_named_elmt(out, "percent agreement", p_o); apop_data_add_named_elmt(out, "κ", ((p_e==1)? 0: (p_o - p_e) / (1-p_e) )); apop_data_add_named_elmt(out, "π", ((p_e==1)? 0: (p_o - scott_pe) / (1-scott_pe))); apop_data_add_named_elmt(out, "P_I", ia/((row_ent+col_ent)/2)); apop_data_add_named_elmt(out, "Cohen's p_e", p_e); apop_data_add_named_elmt(out, "Scott's p_e", scott_pe); apop_data_add_named_elmt(out, "information in agreement", ia); apop_data_add_named_elmt(out, "row entropy", row_ent); apop_data_add_named_elmt(out, "column entropy", col_ent); return out; }   #include "169-kappa_and_pi.h" int main(){ int readin_status = apop_text_to_db("amash_vote_analysis.csv", .tabname="amash"); Apop_stopif(readin_status== -1, exit(1), 0, "Trouble reading in the data. " "Have you downloaded it to this directory?"); apop_query("create table summed as select vote, party, count(*) as ct from amash group by vote, party"); apop_data *confusion = apop_db_to_crosstab("summed", "vote", "party", "ct"); apop_data_show(confusion); apop_data_show(kappa_and_pi(confusion)); //Find the contribution level that matches the count of democrats; //use that cutoff to produce a binary small-contrib|big-contrib categorization. //There are 194 dems. The contribution cutoff is thus$13,000. There are four
double dem_count = apop_query_to_float("select count(*) from amash where party='Democrat'");
apop_query("create table contrib_sums as select vote, "
"contribs> (select max(contribs) from "
"(select contribs from amash order by contribs limit %g)) "
"as big_dollars, "
"count(*) as ct from amash group by vote, big_dollars", dem_count);

confusion = apop_db_to_crosstab("contrib_sums", "vote", "big_dollars", "ct");
apop_data_show(confusion);
apop_data_show(kappa_and_pi(confusion));
}


9 September 13.

### Making a data structure comfortable

This is the last entry in a series of posts on the design and use of the apop_data struct, beginning with this entry. Next time, I'll get to some more statistical application, regarding interrater agreement.

The purpose of this five-entry series was to consider all the little details of what makes a good structure for data, and this final episode covers four additions to the structure so far to make this structure extensible and, well, comfortable.

###### Names

I haven't mentioned column and row names, though they are absolutely essential for us as humans to understand our data, and are a key difference between a matrix manipulation library and a data analaysis library. But they largely take care of themselves. When you query from the database, the names get put in place, and then it's my problem as the author of functions like apop_data_transpose, apop_data_stack and apop_data_split to make sure that the names get put in the right place.

The most basic use of names is to use them interchangeably with the numeric index of the row or column:

apop_data *d = apop_data_calloc(3, 3); //allocate 3x3 matrix filled with zeros.

apop_data_set(d, 2, 2, 2.2);
apop_data_set(d, .rowname="middle", .colname="right", .val=1.2);
apop_data_set(d, .rowname="top", .col=0, .val=0.0);

//view columns or rows by name:
Apop_data_col_t(d, "right", rightcol);
double rightsum = apop_vector_sum(rightcol);


The names are in their own mitochondria-like struct, the apop_name, which is rarely if ever seen outside of apop_data structs, but provides its host apop_data struct with much of its power. As with mitochondria, names being in a separate structure was an evolutionary anomaly, but there isn't much reason to merge them fully into the host structure either.

Also, having a separate struct makes it easier to implement hashes for the text of names, to speed up name lookups. (This is another to-do item for anybody interested in working on Apophenia).

###### Weights

Real-world data sets are often weighted, so our data set will need a weights vector. At my workplace, we have surveys where there are more weights columns than data columns, but those can always be reduced to a single weight column for any given analysis.

So let's add that as another vector to the apop_data set.

There used to be separate apop_ols and apop_wls (weighted least squares) models. But there was no need: if you give me weights, the apop_ols functions use them to implement WLS, else it runs as ordinary.

An empirical distribution is one where the probability of an event equals its frequency in observed data. So we take a data set, add weights, and call it a model. The most common weighting is where every element has equal weight, so that is what is assumed if your_data->weights==NULL. In either case, apop_estimate(your_data, apop_pmf) outputs a PMF model that appropriately wraps the data. See also apop_data_pmf_compress which reduces redundant observations to one weighted observation.

###### Pages

At this point, we have rectangular data covered. Apophenia tries to not place restrictions requiring that the vector and matrix have the same number of rows, so you could even have one column of numbers that doesn't quite fit the rectangle.

But some data doesn't fit a single grid at all. Maybe your aggregate model requires some data at the individual level and distinct additional data at the aggregate level. Maybe you have a parameter set that is a vector, a matrix, and a handful of loose scalars.

So the apop_data struct includes a pointer to apop_data. Suddenly it's a linked list of data sets. Given two rectangles of data as per the examples above, link one to the next (via apop_data_add_page) and there you have it: one data set with two distinct pages that you can send to a custom model written around this data configuration.

The MLE routine will gladly optimize over paramters that span several pages. I can't believe I haven't demo'ed this yet, but this is yet another opportunity to build bigger models by taping together simple models.

Also, there's footnotes. Maybe you have the covariance of the main data set, or a list of factor conversions, or even text footnotes referenced in the main data (see NaN boxing at this earlier blog post or in 21st Century C). You could put that in a separate apop_data set, but everywhere you send the main data, you have to send the footnote table; might as well tack it on to the main data set.

I wound up with this rule: if the name of a data set is in <html-style angle brackets>, then it's not data per se but information about data. Those operations that act on all pages of a data set (like apop_data_pack and apop_data_unpack, to de-structure a data set into one long vector) will typically ignore info pages unless you ask them not to. For example, Apophenia's MLE routine has to use apop_data_pack to de-structure the parameters, because the GSL's optimization routines work on gsl_vectors, so info pages in a parameter set are ignored.

###### The error marker

Oh, and one last thing your math textbook didn't tell you about: processing errors happen. The error eleemnt is a single character, and if it isn't nul, then something went wrong with processing your data set. This makes for a nice processing flow, because you can directly ask the data set returned by a function whether the process went OK.

###### Now have some fun

So there's the apop_data struct. It started with just a rectangular matrix, but expanded to be a vector-matrix hybrid, with a text grid, and weights, and names, and extra pages for additional rectangles of data, and a marker for processing errors. This isn't Nobel-worthy innovation, but designing a structure with enough hooks so that a diverse range of data sets can be handled comfortably and reliably is not entirely trivial, either.

Working with it feels like manipulating building blocks, and it's structured enough that it's easy (albeit sometimes tedious) to write functions manipulating those blocks in predictable ways. I've found working with it to be largely natural, and very predictable. There are many situations where I'm not doing statistics per se, but use Apophenia to keep tables of data organized.

Scroll down to the 49 functions currently listed in the apop_data_… section of Apophenia's function index to see the sort of utility functions that I've written around the structure. Besides the ones that shunt data around, there are also a few, like apop_data_summarize, apop_data_correlation, or apop_data_covariance that do basic statistical work.

30 August 13.

### Dealing with arrays of text

Dealing with text in C is annoying for several reasons, primary among them the fact that changing a string often involves a memory reallocation. In 21st Century C, I talk about the lovely asprintf function; this was one of the most popular bits of the book; e.g., see the summary on this blog.

Here, I consider the next step: an array of text, which we get when we have text data in a data set. This entry will describe where I wound up in trying to design a setup for dealing with grids of text data.

Last time, I talked about the straightforward gsl_vector and gsl_matrix functions, and the interface functions in Apophenia that smooth over some of the details. Similarly, the text element of the apop_data set is in no way special—it is declared as char ***text—but with the right interface functions, we can use this unremarkable structure reasonably gracefully.

• Declare with apop_text_alloc. This generates the pointer-to-pointer-to-pointer grid for you, which is already saving a lot of pain. This is really a realloc function, so if you change the size to larger or smaller, there's no memory leak. You can only produce a rectangular array of elements, though having a long list of pointers-to-NULL As Apop. v1 will have is rarely a serious problem.
• Add text with apop_text_add. This is really a realloc function, so if there's already text in that slot, there's no memory leak.
• Add lots of static text at once with apop_text_fill.
• Free the whole thing with apop_text_free, which you'll never use anyway, because you'll use apop_data_free to free the entire data set at once.

OK, we're done with memory management of strings in C: just use those three functions, and it's all taken care of. Again, other systems do it in other ways, such as generating a struct that includes string metadata, thus forcing users to always use the accessor methods of the object.

Accessing the elements of a 2D grid of structs is also straightforward, once you have the rules down. Given a data set declared via apop_data *td = apop_text_alloc(NULL, 3, 3):

• td->textsize[0] is the row count; td->textsize[1] is the column count.

• Element $(i, j)$ is td->text[i][j].

With the matrices and vectors, I was concerned with having a setup that gracefully dealt with matrices, vectors, and scalars. Why should I always have to specify two indices when I sometimes only need one or zero? We can do a similar thing using the rule that *x and x[0] are equivalent in C—you can think of one as syntactic sugar for the other see §6.5.2.1(2) of the C standard.

• The oft-used row count is *td->textsize.

• If you know you have only a single column of text, get item $i$ via *td->text[i]. If you know you have a single text element, it is **td->text.

So there's the string-handling system via the apop_data struct. Write using the allocate and add functions, free with the free function, and read directly from the text and textsize elements.

###### Some bonus functions

Transpose via apop_data_transpose, which transposes both the matrix and text elements of the apop_data set that gets input. This is especially useful when you have a function that needs a list of strings, because the column of strings above was really a list of pointers-to-one-string, but each row of the text array is one pointer-to-strings as desired.

It might be nice to have a few view functions that view columns or subgrids. One could implement them the way that vector views were implemented in the GSL, by setting up a list of starting points and changing the textsizes accordingly. That's a hint, for anybody who's interested in contributing. It's nontrivial, though.

Use apop_text_paste to merge a grid of strings into a single string. This is an emulation of R's paste function, modified to more naturally handle a 2-D grid of text. With creative separators, you can produce all sorts of things with this. E.g., the documentation for the function prints a text grid as an HTML table, by putting the appropriate HTML tags between elements, between rows, and at the head and tail.

Conversely, if you need to split a string into parts, use apop_regex. Using parens in a regular expression tells the regex parser to return those parts of the regex to the user. The POSIX-standard regex parser will return a pointer to the end of the matched string, and we can pick up where that leaves off to parse again. The output from apop_regex will have columns of text corresponding to the parens in your regex, and columns corresponding to repeated application.

Here's an example that takes in a regex and a string, then parses it down and prints the results in a table. This would be pulling teeth in raw C, but with the right interface functions writing this in C is much like writing it in a pointer-free scripting language.

I wrote this as a shell script. Cut and paste it to your command line to have gcc produce a.out and run it as per the examples given.


gcc -xc - pkg-config --libs --cflags apophenia -g -Wall --std=gnu99 -O3 << "—-"
#include <apop.h>

int main(int argc, char **argv){
Apop_stopif(argc == 2, return 1, 0, "Give two arguments: a regex and a string.");
apop_data *txt;
int returnval = apop_regex(argv[2], argv[1], &txt);
Apop_stopif(!returnval, return 1, 0, "No matches found.");
Apop_stopif(returnval==-1, return 2, 0, "Bad regex.");
printf("%s", apop_text_paste(txt, .before="∙ [", .between_cols="] [", .between="]\n∙ [", .after="]\n"));
}
—-

a.out "([[:alpha:]]+), *([-[:digit:].]+)" "{pants, 23} {plants,-14} {dirigibles, 12.81}"
echo —--
a.out "((<[^<]*>)([^<]*)</[^>]*>)" "<b>pants</b> <em>blazers</em>"
echo —--
a.out "([[:alnum:].])" " hello. "