26 January 12.

Tip 58: Destroy your inputs

[PDF version]

level: function writer
purpose: use the declarations you already have

This is not a new tip. In fact, it's on page 24 of the 1ed. of K&R:

Call by value is an asset, however, not a liability. It usually leads to more compact programs with fewer extraneous variables, because arguments can be treated as conveniently initialized local variables in the called routine.

I really hope this is an ex post justification for the system, not their real motivation, because the technique of screwing around with the input variables knowing they can't affect the main program has somewhat limited utility.

Integers can often be used as countdown variables. To rewrite K&R's example:

double power(double x, int n){ //assume n>=0
    double out =1;
    for (; n>0; n--) out *= x;
    return out;
}

int main(){
    printf("2^5: %g\n", power(2, 5));
    printf("2.5^5: %g\n", power(2.5, 5));
    printf("3^5: %g\n", power(3, 5));
}

If you have a NULL-terminated list, you already have what you need to step through it. We guaranteed such a list in Tip 26, so let's rewrite the code there without bothering with an index in the sum_base function:

#include <math.h> //NAN
#include <stdio.h>

#define sum(...) sum_base((double[]){__VA_ARGS__, NAN})

double sum_base(double  in[]){
    double out=0;
    for ( ; !isnan(*in); in++) out += *in;
    return out;
}

int main(){
    double two_and_two = sum(2, 2);
    printf("2+2 = %g\n", two_and_two);
    printf("(2+2)*3 = %g\n", sum(two_and_two, two_and_two, two_and_two));
    printf("sum(asst) = %g\n", sum(3.1415, two_and_two, 3, 8, 98.4));
}

The code isn't actually shorter, but the for loop uses the input pointer to step through the array, rather than using an index to step through. If we had a linked list, then it may be impossible to use the index-based form, so you're forced to use a form like this one.

This may come up frequently in main, because argv is just the sort of list you would want to step through--and the standard says that its last element (argv[argc]) is NULL (C99 & C11 §5.1.2.2.1.2, which also says that you are allowed to modify the values of argv in place--they are not string constants). So, in fact, argc may be entirely redundant.

OK, you can count down with integers, and you can use pointers to arrays or lists to step through the structure and check up on each element individually. Are there any other ways to take advantage of inputs as conveniently initialized variables?


[link][no comments]


24 January 12.

Tip 57: Base your code on pointers to objects

[PDF version]

level: Your code depends on a class of objects
purpose: don't worry about scope when you don't want to

So you've seen several examples at this point of an object setup, with a typedef and new/copy/free functions. There was the fstr*, a pointer to a file-as-string, the group*, a pointer to a social group which our imaginary agents joined and left, and the apop_model *, a pointer to a statistical or simulation (or other) model.

Why did I base all of these things on pointers to data structures, instead of just passing around data structures directly? Using a plain struct is as easy as rolling out of bed. If you use a pointer-to-struct, you require a new/copy/free function; if you use a plain struct, then:

new
Use designated initializers on the first line where you need a struct.
copy
The equals sign does this.
free
Don't bother; it'll go out of scope soon enough.

So we're making things more difficult for ourselves with pointers. Yet from what I've seen, there's consensus on using pointers to objects as the base of our designs.

Pros to using pointers:

  • Copying a single pointer is cheaper than copying a full structure, so you save a picosecond on every function call with a struct as an input. Of course, this only adds up after a few hundred million function calls.
  • Data structure libraries (your trees and linked lists) all have hooks for a void pointer.1
  • Now that you're filling a data structure, having the system automatically free the struct at the end of the scope in which they were created may be an annoyance.
  • Many of your functions that take in a struct will modify the struct's contents, meaning that you've got to pass a pointer to the struct anyway. Having some functions that just take in the struct and some that take in a pointer to struct is confusing (I have one interface like this and I regret it), so you might as well just send a pointer every time.
  • Once you have a pointer inside the struct, then the convenience bonus from using a plain struct evaporates anyway: if you want a deep copy (wherein the data pointed to is copied, not just the pointer) then you need a copy function, and you will probably want a free function to make sure the internal data is eliminated.

As your project gets bigger, and a throwaway struct grows into a core of how your data is organized, the pros for pointers wax and the pros for non-pointers wane. That's why these tips, taken together, don't provide a one-size-fits-all set of rules for using structs. The best techniques for using the throwaway structs are different from those for using structs as the core of your data organization.


Footnotes

... pointer.1
A side note on the void pointer thing: using a void pointer is giving up the type-checking that is decidedly a good thing and which has saved all of us heartache at some time or another. With regards to linked lists and such, however, I've never perceived a problem. If I have a linked list named active_groups and another list named persons, both of which take in a void pointer, then the compiler would let me append person to the active_groups list, but it is obvious to me as a human that I'm touching the wrong list. As a practical matter, it doesn't take all that much care to check that you don't hook the wrong thing onto a data structure's void pointer hook.

Things get much worse when you have a void pointer as input to a function, and when using that kind of mechanism, I try to keep the function and the calls to the function close together.



[link][no comments]


21 January 12.

Tip 56: Enums---don't bother

[PDF version]

level: You have laundry lists
purpose: don't force users to memorize your laundry lists

Enums are a good idea that went bad.

The benefit is clear enough: integers are not at all mnemonic, and so wherever you are about to put a short list of integers in your code, you are better off naming them. Here's (the even worse means of) how we could do it without the enum keyword:

#define NORTH 0
#define SOUTH 1
#define EAST 2
#define WEST 3

With enum, we can shrink that down to one line of source code, and our debugger is more likely to know what EAST means; here's the improvement over the sequence of #defines:

enum directions {NORTH, SOUTH, EAST, WEST};

But we now have five (5) new symbols in our global space, including directions and NORTH, SOUTH, et al.

For an enum to be useful, it typically has to have global scope. For example, you'll often find enums typedefed in the public header file for a library. Having a global variable creates responsibilities. To minimize the chance of namespace clashes, library authors use names like G_CONVERT_ERROR_NOT_ABSOLUTE_PATH or the relatively brief CblasConjTrans. I have to look them up every time and they take up half the line.

At which point an innocuous and sensible idea has fallen apart. I don't want to type these messes, and I use them so infrequently that I have to look them up every time (especially since many are infrequently used error values or input flags). Also, all caps continues to read like yelling.

My own habit is to use single characters, wherein I would mark transposition with 't' and a path error with 'p'. I think this is enough to be mnemonic--in fact, I'm far more likely to remember how to spell 'p' than how to spell the above all-caps mess--and it requires no new entries in the namespace.

Before you start arguing easy-to-parody efficiency issues, bear in mind that an enumeration is typically an integer, while char is really just C-speak for a single byte. So when comparing enums, you will likely need to compare the states of about sixteen bits, while with a char, you need compare only eight. &#161;A two-fold speed gain!, dropping the time for this comparison from effectively zero and not worth tracking to effectively zero and still not worth tracking.

We sometimes need to combine flags. When opening a file using the open system call, you may need to send O_RDWR|O_CREAT, which is the bitwise combination of the two enums. You probably don't use open directly all that often; you are probably making more use of fopen, which is more user friendly. Instead of using an enum, it uses a one- or two-letter string, like "r" or "r+" to indicate whether something is readable, writeable, both, et cetera.

In the context, you know "r" stands for read, and if you don't have the convention memorized, you can confidently expect that you will after a half-dozen more uses of fopen. Whereas I still have to check whether I need CblasTrans, CBLASTrans CblasTranspose, ..., every time.

Again, caring about the runtime efficiency thing is `70s. The twenty seconds it takes to look up an awkward enum times a dozen re-lookups is equivalent to a few billion strcmps between two two- or three-letter strings. If you really think it matters, then, as above, you'd rather use a single character than an enum.

There are reasons for using enums: sometimes you have an array that makes no sense as a struct but that nonetheless requires named elements; when doing kernel-level work, giving names to bit patterns is essential. But in those cases where enums are used to indicate a short list of options or a short list of error codes, a single character or a short string can serve the purpose without cluttering up the namespace or authors' memory.


[link][no comments]


19 January 12.

Tip 55: Mark input pointers with const

[PDF version]

level: Not just writing code: writing code with style
purpose: Clarify your intent

Early in your life, you learned that copies of input data are passed to functions, but you can still have functions that change input data by sending in a copy of a pointer to data.

When you see that an input is plain, not-pointer data, then you know that the caller's original version of the variable won't change. When you see a pointer input, it's unclear. Lists and strings are naturally pointers, so the pointer input could be data to be modified, or it could just be a string.

The const keyword is a literary device for you, the author, to make your code more readable. It is a type modifier indicating that the data pointed to by the input pointer will not change over the course of the function.

As a literary device, however, it is problematic. The compiler does not lock down the data being pointed to against all modification. In the following example, a and b point to the same data, but because b is not const in the header for set_elmt, it can change the third element of the a array.

void set_elmt(const int *a, int *b){
    b[0] = 3;
}

int main(){
    const int a[10];
    int *b = (int*)a;
    set_elmt(a, b);
}

So it is a literary device, not a lock on the data.

In practice, you will find that the keyword sometimes creates tension that needs to be resolved, when you have a pointer that is marked const, but want to send it as an input to a function that does not have a const marker in the right place. Maybe the author read somewhere that the keyword was too much trouble, or believes the chatter about how shorter code is always better code, or just forgot.

Before proceeding, you'll have to ask yourself if there is any way in which the pointer could change in the const-less function being called. There may be an edge case where something gets changed, or some other odd reason. This is stuff worth knowing anyway.

If you've established that the function does not break the promise of const-ness that you made with your pointer, then it is entirely appropriate to cheat and cast your const pointer to a non-const for the sake of quieting the compiler. I actually did that above with the line about int *b = (int*)a. Without the cast from const int* to int*, the compiler would have complained.

When sending a const pointer to a function without a const in the header, we can use the same casting to quiet the compiler:

//no const in the header this time:
void set_elmt(int *a, int *b){ b[0] = 3; }

int main(){
    const int a[10];
    int *b = (int *) a;
    set_elmt((int*)b, b);
}

The rule seems reasonable to me. You can override the compiler's const-checking, as long as you are explicit about it and indicate that you know what you are doing.

If you are worried that the function you are calling won't fulfill your promise of const-ness, then you can take one step further and make a full copy of the data, not just an alias. You should get the same result from the function either way.

Constness lacks depth
Let us say that we have a structure, typedef struct {int *list, int list_length} list_t, and we send in a pointer to this structure. Are the elements of the structure const or not?

Let's try it: the following program generates a struct with two pointers, and in check_all_counters that struct becomes const, yet when we send one of the pointers held by the structure to the const-less subfunction, the compiler doesn't complain.

typedef struct {
    int *counter1, *counter2;
} counters;

void check_counter(int *ctr){ assert(*ctr !=0); }

void check_all_counters(const counters *in){
    check_counter(in->counter2);
}

int main(){
    counters cc = {.counter1=malloc(sizeof(int)), .counter2=malloc(sizeof(int))};
    *cc.counter1= *cc.counter2=1;
    check_all_counters(&cc);
}

If you really need to protect an intermediate level in your hierarchy of types, like you have a list of pointers that can't move but the data in the pointers can change, then maybe you're better off just documenting the promise. If you do const only an intermediate level in the hierarchy, do your readers a favor and document that too.

So there are the problems: if you send a const pointer to a non-const function, you'll need to cast; if you have a const struct, its pointer elements are not const, and so you'll have to keep an eye on them without the compiler's help.

As literature goes, that isn't all that problematic, and the recommendation that you add const to your headers as often as appropriate still stands--don't just grumble about how the people who came before you didn't provide the right headers. After all, some day somebody else will use your code, and you don't want them grumbling about how they can't use the const keyword because your functions don't have the right headers.


[link][no comments]


17 January 12.

Tip 54: Put functions in your structs

[PDF version]

level: your structures are getting diverse
purpose: maintain a standard form over diversity

Since the last two episodes included some long sample code, this time I'm going to just tell you about Apophenia, the library of stats functions that co-evolved with Modeling with Data.

One of the key data structures is intended to represent a statistical model. Broadly, statistical models are all the same: the box diagram would have parameters on one side (think of the mean and variance of a Normal distribution, μ and σ), then another arrow for the data (a data point, x), and the black box would spit out a probability, P(x| μ, σ).

So our struct will have elements named parameters and data, which are not especially challenging. There should be a probability function to go from parameters and data to P(parameters, data), and here we run into a problem: there is a different calculation for every model. The math for a Normal distribution has nothing to do with the math for a Poisson distribution or a Binomial distribution. So having a single function prob(parameters, data, model) won't work.

The cleanest solution is to put a slot for the function inside the struct typedef, and associate a new function with every struct:

typedef double (*model_to_double)(apop_model *);

typedef struct {
    apop_data *parameters, *data;
    model_to_double *prob, *log_likelihood;
} apop_model;

Simulation models typically use the probability, but people who work with probability distribution-based models tend to work with the log probability (i.e., the log likelihood, where the distinction between probability and likelihood is irrelevant for our purposes), so I threw in a slot for both.

Now, when we define a new model object, we set the functions as needed.

/* This is very cut down from the Apophenia source, and I'm not going 
to explain the math here.  
Focus on the declaration below the function.
*/
static double apply_me(double x, void *mu){ return gsl_pow_2(x - *(double *)mu); }

static double normal_log_likelihood(apop_data *d, apop_model *params){
    //check that params->parameters is not null here.
    double mu = apop_data_get(params->parameters, 0, -1);
    double sd = apop_data_get(params->parameters, 1, -1);
    return -apop_map_sum(d, .fn_dp = apply_me, .param = &mu)/(2*gsl_pow_2(sd))
            - tsize*(M_LNPI+M_LN2+log(sd));
}

apop_model apop_normal= {.name="Normal distribution", .log_likelihood=normal_log_likelihood};

Hey, wait--after all that I didn't define the probability function. But it's easy to calculate: the log probability is log(probability), and probability = exp(log likelihood). Instead, here is a dispactch function prob, which lives outside of ths struct, and would be used the way all the functions worked before we started putting methods inside structs. We usually see the struct as the first input to such functions, but Apophenia's rule for ordering the inputs is that the data always comes first.

double prob(apop_data *d, apop_model *p){
    assert(p->parameters); //I expect users have set this before calling.
    if (p->prob)
        return p->prob(d, p);
    else if (p->log_likelihood)
        return exp(p->log_likelihood(d, p));
    else Apop_error("I need either the prob or log likelihood "
                    "methods in the input model.");
}

If there were a sensible default method for calculating the probability given data and parameters, we'd put it here in the dispatch function. For the estimation of a model (find the most likely parameters given data), there actually is a default, in the form of maximum likelihood methods. So the estimate routine looks like

//again, this is cut.
apop_model * apop_estimate(apop_data *d, apop_model *p){
    if (p->estimate)
        return p->estimate(d, p);
    else 
        return apop_maximum_likelihood(d, p);
}

If we put a function inside of the struct, then we need to point the struct to the right function on initialization every time. If we can reasonably expect that the function will be different every time, then that's exactly what we need.

If you have a new_struct function that gathers together the input data and functions and spits out a cleaned-up struct, then you can use that setup function to assign a default function, the way that apop_estimate used the maximum likelihood function when the struct didn't have its own estimate routine.

I prefer to set up structures using designated initializers, so that means I need a dispatch function that checks for the method being called and uses a default as needed.

I want this
Further, let me point out a sad fact about methods inside of structs. You pretty much always need to send in the struct itself, which means redundancy. If we didn't have the dispatch function, we'd have calls like normal_dist.p(data, normal_dist).

C++-type languages have a special rule that the first element of a function inside a struct will be the struct itself. Either you write a method with the appropriate first argument, and so a header like normal_estimate(apop_model this, apop_data *d); or the system may just define a special variable this to point to the parent struct.

C doesn't define magic variables for you, and it is always honest and transparent about what parameters get sent in to a function. Normally, if we want to futz around with the parameters of a function, we do it with the preprocessor, which will gladly rewrite f(anything) to f(anything else). However, all of the transformations are a function of what goes on inside of the parens. There's no way to get the preprocessor to transform the text s.prob(d) to s.prob(s, d). If you don't want to slavishly imitate C++-type syntax, you can write a macro

#define prob(s, ...) s.prob(s, __VA_ARGS__)

But now you've cluttered up the global namespace with this prob symbol. So there's one more reason to use a dispatch function: it can take care of the redundancy for you.


[link][no comments]


15 January 12.

Tip 53: Count references

[PDF version]

level: your data structures are getting busy
purpose: Know when to hold `em; fold `em

Last time, we saw the situation where there was one data set being viewed many times, meaning that one struct was canonical and the others, although otherwise identical, were subsidiary.

This time, we'll point to the same data many times over, and each pointer will be identical. After we have two or three references to the data, we won't know or care which came first. But the problem is similar: we need to free the struct and its contents when there are no references to it left, and no sooner.

Adding an owner bit to the structure was an easy four-step process last time, and it'll be approximately the same process this time:

  • The type definition includes an integer named counter.
  • The new function sets counter = 1.
  • The boilerplate copy function sets counter++.
  • The free function queries if(-counter==0), and if yes, then free all shared data; else, just leave everything as is, because we know there are still references to the structure.

Again, as long as your work with the structure is entirely via the new/copy/free functions, this will work fine.

Here's another full example. It is an agent-based model of group membership. Agents are on a two-dimensional preference space (because we'll plot the groups) in the square between (- 1, - 1) and (1, 1). Their utility from a group is -(distance to group's mean position + M*number of members). The group's mean position is simply the mean of the positions of the group's members (excluding the agent querying the group), and M is a constant that scales how much the agents care about being in a large group relative to how much they care about the group's mean position. At each round, agents will join the group with the best utility to the agent.

With some random odds, the agent will originate a new group. However, because agents are picking a new group every period, the agent may abandon that newly originated group in the next period.

Brace yourself--this takes almost 125 lines of code.

The header. What I call the join and exit functions might more commonly be read as the copy and free functions. The group_t structure has a size element, which is the number of group members--the reference count. You can see that I use Apophenia and Glib. Notably, the groups are held in a linked list, private to the groups.c file; maintaining that list will require fully two lines of code, including a call to g_list_append and g_list_remove.

#include <apop.h>
#include <glib.h>

typedef struct {
    gsl_vector *positions;
    int id, size;
} group_t;

group_t* group_new(gsl_vector *positions);
group_t* group_join(group_t *joinme, gsl_vector *position);
void group_exit(group_t *leaveme, gsl_vector *position);
group_t* group_closest(gsl_vector *position, double mb);
void print_groups();

Now for the file defining the details of the group object. Notice that the printout function is Gnuplot friendly. If you use another plotting system, it shouldn't take but a few seconds to rewrite it as needed.

#include "groups.h"

GList *group_list;

group_t *group_new(gsl_vector *positions){
    static int id=0;
    group_t *out = malloc(sizeof(group_t));
    *out = (group_t) {.positions=apop_vector_copy(positions), .id=id++, .size=1};
    group_list = g_list_append(group_list, out);
    return out;
}

//The copy function
group_t *group_join(group_t *joinme, gsl_vector *position){
   int n = ++joinme->size;  //increment the reference count
   for (int i=0; i< joinme->positions->size; i++){
       joinme->positions->data[i] *= (n-1.)/n;
       joinme->positions->data[i] += position->data[i]/n;
   }
   return joinme;
}
    
//The free function
void group_exit(group_t *leaveme, gsl_vector *position){
   int n = leaveme->size--;  //lower the reference count
   for (int i=0; i< leaveme->positions->size; i++){
       leaveme->positions->data[i] -= position->data[i]/n;
       leaveme->positions->data[i] *= n/(n-1.);
   }
   if (leaveme->size == 0){ //garbage collect?
       gsl_vector_free(leaveme->positions);
       group_list= g_list_remove(group_list, leaveme);
       free(leaveme);
   }
}

group_t *group_closest(gsl_vector *position, double mass_benefit){
    group_t *fave=NULL;
    double smallest_dist=GSL_POSINF;
    for (GList *gl=group_list; gl!= NULL; gl = gl->next){
        group_t *g = gl->data;
        double dist= apop_vector_distance(g->positions, position, 'L', 3)-
            mass_benefit*g->size;
        if(dist < smallest_dist){
            smallest_dist = dist;
            fave = g;
        }
    }
    return fave;
}

void print_groups(){
    printf("plot '-' with points pointtype 6\n");
    for (GList *gl=group_list; gl!= NULL; gl = gl->next)
        apop_vector_print(((group_t*)gl->data)->positions);
    printf("e\n");
}

The program file, defining the array of persons, and the main loop of rechecking memberships and printing out. At this point all interface with the groups happens via the new/join/exit/print functions. There is zero memory management code in this file--the reference counting guarantees us that when the last member exits the group, it will be freed.

Again, main does some Gnuplot-specific stuff, so if you saved this as groupabm.c, then call groupabm | gnuplot on your command line.

#include "groups.h"

int pop=2000, 
    periods=200,
    dimension=2;

typedef struct {
    gsl_vector *positions;
    group_t *group;
} person_t;

void check_membership(person_t *p, gsl_rng *r, double mass_benefit, double new_group_odds){
    group_exit(p->group, p->positions);
    p->group=NULL;
    if (!p->group) p->group =
        (gsl_rng_uniform(r) < new_group_odds)
             ? group_new(p->positions)
             : group_join(group_closest(p->positions, mass_benefit), p->positions);
}

person_t person_setup(gsl_rng *r){
    gsl_vector *posn = gsl_vector_alloc(dimension);
    for (int i=0; i< dimension; i++)
        gsl_vector_set(posn, i, 2*gsl_rng_uniform(r)-1);
    return (person_t){.positions=posn};
}

void init(person_t *people, int pop, gsl_rng *r){
    for (int i=0; i< pop; i++)
        people[i] = person_setup(r);
    //start with ten groups
    for (int i=0; i< 10; i++)
        people[i].group = group_new(people[i].positions);
    for (int i=10; i< pop; i++)
        people[i].group = group_join(people[i%10].group, people[i].positions);
}

int main(){
    double  new_group_odds = 1./pop,
            mass_benefit = .7/pop;
    gsl_rng *r = apop_rng_alloc(1234);
    person_t people[pop];
    init(people, pop, r);
    printf("unset key;set xrange [-1:1]\nset yrange [-1:1]\n");
    for (int t=0; t< periods; t++){
        print_groups();
        for (int i=0; i< pop; i++)
            check_membership(&people[i], r, mass_benefit, new_group_odds);
    }
}


[link][no comments]