Thursday, September 23, 2010

The variant record update problem

Now, for something completely different...

I've been helping out with the design and implementation of a languge called C0 that is being used in a new course at Carnegie Mellon, 15-122 Principles of Imperative Computation. I'm giving my own point of view here, but I'd say that the goals of this language were four-fold:
  • Programs should be memory safe (a pointer either contians NULL or the address of valid, allocated memory, and you can't read off the end of an array).
  • The language should have no unspecified behavior. This is related to the point above ("oh, you did something undefined here so it's not my fault this pointer got decremented by 17 afterwards") but is still a distinct point - OCaml can be memory safe without specifying the order of evaluation of function arguments, for instance.
  • Programs (at least the family of programs required to do homework tasks for 15-122) should look and act like C programs. (Featherweight Java programs act, but do not look, like Java programs.) And not "more or less" the same as C - GCC should basically be a viable way of compiling C0 programs if they don't rely on behavior that is specified in C0 but unspecified in C.
  • The language should be amenable to garbage collection.
The language design process is primarily the work of Rob Arnold (who just presented his master's thesis yesterday) and his advisor, Frank Pfenning (also my advisor), who is teaching 15-122.

After Rob Arnold's defense, he, William Lovas, and I got into a discussion about adding unions in a safe way to the C0 language. In the process we had to re-derive the fundamental problem with having remotely C-like unions in memory-safe programming languages. I am 90% certain that this is what the literature occasionally alludes to as "the variant record update problem." This is me writing that down so I won't have to derive it for myself yet again.

A piece of "reasonable" C, using unions

It is often the clearest if we think of all data in C as simply a bundle of bits; unions in C seem mostly like a way of providing different "views" for reading and writing into a bundle of bits. This is pretty well-illustrated by the following example, which we can then use to explain a very old problem with trying to tame this kind of union and treat it as a sum type or tagged unions in a memory-safe way.
  1.  typedef struct pair { 
2. unsigned char fst;
3. char snd;
4. } pair;
5. typedef union {
6. unsigned short inj1;
7. pair inj2;
8. } foo;
9. int f(foo* foo) {
10. foo->inj1 = 1536;
11. return 7;
12. }
13. int main() {
14. foo foo;
15. foo.inj1 = 1535;
16. // foo = Inj1(1535)
17.
18. foo.inj2.fst = 9;
19. foo.inj2.snd = 24;
20. // foo = Inj2(9, 24)
21.
22. foo.inj2.fst = f(&foo);
23. // WTF? Is foo Inj1 or Inj2?
24.
25. return 1;
26. }
So, in C, we could pretty reasonably predict what should happen here given three pieces of information:
  1. The size of a char and a short. (A char is 1 byte and a short is 2 bytes, in this case.)
  2. The way C lays out structs in memory. (This is actually reasonably well-defined.)
  3. The "endian-ness" of the machine we're working on. (I just worked it out from observerd results, myself: I can never remember which is which.)
In particular, this is what the 16 bits of stack storage allocated to foo looks like at various points in this program:
                       01234567 89ABCDEF
| fst | snd |
At line 16 (in main) 11111111 10100000
At line 20 (in main) 10010000 00011000
At line 11 (in f) 00000000 01100000
At line 23 (in main) 11100000 01100000
The assignment in f() happened before the crazy assignment was complete, writing a valid 16-bit short into the 2 bytes available to it. Then, the crazy assignment completed after f() returned, causing the first byte of that 2-byte short to get overwritten with the integer value 7 that was returned from f.

A non-starter

The basic idea that everyone has when trying to turn an unsafe language with untagged unions into a tagged language with tagged unions is "all we need to do is add tags and check them at the correct time." This example shows that there is a more fundamental problem, because there are a lot of different arguments for what the code above should even do in a safe programming language. The C semantics don't work: if foo.inj1 and foo.inj2.snd were pointers, then at line 23 foo can not safely be tagged as inj1 (part of its pointer got overwritten by the number 7) or as inj2 (because the second component is a pointer that got partially overwritten by a pointer in f().

We came up with four alternatives for what could mean, and none of us could agree on any one option; this is, of course, a sure sign that none of them make much sense.
  1. Check, throw an error ("the safe option"). When we make an assignment into a union tagged with inj2, we're asserting that when the right-hand side finishes evaluating to a value, that tag will still be inj2. We have to check, after evaluating the right-hand-side to a value, that this property was maintained - and it's a runtime error if that isn't the case, kind of like running off the end of an array. Technically, this is a bit difficult, but it's certainly possible.
  2. Re-initialize memory. C0 has a notion of default values; whenever we allocate heap space for a new integer, that integer is 0. We'd actually want, after the assignment on line 18, to initialize the value of foo.inj2.snd to a default value (again, if it's a pointer we'd want to be sure it's NULL rather than some arbitrary string of bits that was there before). So, when we're ready to finally write a value, we check its tag - if the tag is wrong, we re-initialize it with the right tag and then perform the assignment; this is what happened on line 18. On line 22, it means that, because f() set the tag of foo to inj1 and we're making an assignment into inj2, we re-initialize the union as an inj2 and then perform the assignment. At line 23, foo has the tag inj2, foo.inj2.fst == 7, and foo.inj2.snd == 0.
  3. Duplicate the entire union's memory block. Rather than after-the-fact tag checking, another option is to observe that we are reading into a union/struct, copy the entire union/struct into some temporary storage after evaluating the left-hand side, make the assignment into the temporarily storage, and then write the whole of the temporary storage back to its original location. In this case, at line 23 foo has the tag inj2, foo.inj2.fst == 7, and foo.inj2.snd == 24, the value it was set to before the assignment.
  4. Secretly allocate memory. If we secretly say that a union is a pointer to some allocated memory, then when we change the tag we have to allocate some new memory. In this case, when f() changes the tag of foo, some new memory is allocated, tagged inj1, and the integer value 1536 is written there. foo is now a tag (inj1) and a pointer to the newly allocated memory, but the old memory (which foo pointed to before the assignment) is still in existence, and so we can safely write the value 7 to the memory location we came up with when we evaluated the left hand side of line 22. When we get to line 23, then, the memory where we just wrote the 7 has been orphaned, foo has a tag inj1, and contains the short integer value 1536. Note that this sort of trickery only makes any sense if we have a garbage-collected programming language; however, C0 is a garbage collected language. However, it's also the only one of these four options that preserves the metaphor that a LHS evaluates to a memory location, a RHS evaluates to a value, and then the assignment stores the RHS value in the LHS memory location without further ado.
I'd be quite interested if anybody had information about real imperative programming languages that made any one of the four choices above, or some other choices.

What's actually going on here?

The real problem here is that C's notation for unions is deeply broken from a "mathematical perspective" if we want to think of unions as sum types. A sum type is not something you "project the inj1 field from" like a struct, it's something that you inject into the inj1th branch of. So a much better notation than foo.inj1 = 4 is foo = inj1(4). We would think of this use of inj1 as somewhat equivalent to a constructor in ML, which leads us to a better understanding of the problem - when injecting into the inj2 branch, it's clear from the constructor notation that we need assign all the constituent values at the same time: foo = inj2(1,2). If this was the only way we could write to inj2, it would remove the difference between options 2 and 3 above - we would have to be clear in advance what we expected foo.inj2.snd to be, that's what it means to write into a branch.

Something like this solution is what was adopted in Ada, if we can believe the following caveat in the Ada Wikibook:
When "mutating" a record, you must assign all components of the variant structure which you are mutating at once, replacing the record with a complete variant structure.
This really seems like it would be the only remotely correct way of dealing with the problem.

Conclusion

Should unions be added to C0, and how? Well, I don't actually have too much of a dog in that fight, and even if I did it's not my decision. Because we don't have "struct literal" values, we're unable to write something like foo.inj2 = { fst = 1, snd = 2 }, so the mostly-reasonable Ada solution is unavailable. A first step might be to disallow structs inside of unions, which means that every union can only ever contain one thing. This prevents the awkward question of "what is this other part of the union after the assignment" from ever arising as an issue - which is really the distinction between options 2 and 3 above.

The restriction of union members to "small values only" (the C0 way of saying "no structs") is limiting from the perspective of C code that one might imagine wanting to write (and is therefore problematic for the goal that "good C0 programs look like good C programs,") but it seems preferable to any of the four options above. In particular, if whenever you wanted to have a struct in a union you instead had a struct pointer, that would basically amount to implementing Option 4 in a non-secretive way.

No comments:

Post a Comment