Recommended reads – the writings of Jon Bentley

Not everyone is interested in reading old stuff. But there are some interesting lessons to be learned in many of the older computer science writings, because they were so well written.  the writings of Jon Bentley. He wrote a great column, Programming Pearls, in Communications of the ACM from many years starting in the 1980s. Most of the anecdotes are also available in book form as Programming Pearls (2nd ed), ISBN 0-201-65788-0.

ppearls

First programming language – industrial vs academic

A language is a language is a language – or so they say.

In the 1950s, when Fortran evolved from the depths of IBM, there were no computer scientists. Let’s not mince words, by todays standards John Backus (who developed Fortran) was a computer scientist, but the reality is that early “computer scientists” were mathematicians. So began the long road of programming language development, with many of the most popular developed by industry (including languages designed for scientific use):

Fortran (IBM, John Backus)
Cobol (Conference on Data Systems Languages, US DOD)
C (Bell Labs, Dennis Ritchie)
C++ (Bell Labs, Bjarne Stroustrup)
Ada (CII Honeywell Bull, Jean Ichbiah, US DOD)
Java (Sun Microsystem, James Gosling)
Python (Guido van Rossum)
Ruby (Yukihiro Matsumoto)

There are of course good languages that had their birth in academe to the likes of ALGOL, Pascal, Modula-2, and Lisp. Many of these languages contributed to the evolution of others – but very few were successful in industry. Languages used in introductory programming courses generally tend to follow one of two paths: (i) a commercial language, or (ii) an “academic” language. The former are dominated by the likes of Java and Python. The latter can be scoped into functional languages such as “Scheme”, and Haskell, which still have limited industrial scope.

So which is better – industrial or academic? There is of course a good argument to be made for industrial languages – best to use what industry advocates, or at least obtain a grounding in it. One can go too industry specific too early, i.e. Objective C, are end up having learnt a language which is being supplanted (by Swift in the case of iOS). There is  also the case for using simple languages which advocate the teaching of programming concepts, rather than syntax. For the purpose of teaching the craft of programming, I would advocate a language which has some industry force, but in which complexity such as pointers can be minimized, and language idiosyncrasies such as dangling-else can be avoided. It would also be nice to dodge compiler issues like lack of array-out-of-bounds checking. Foremost a language used in introductory programming is used to illustrate programming concepts. The syntax of a language should not overwhelm novice programmers.

Good vis-à-vis bad choices for a first programming language, based on the idea of teaching the craft of structured programming?

  • GOOD: Pascal, Ada, Python, Julia, Fortran
  • BAD: Scheme (and anything based on Lisp), industry specific languages, Java, Cobol, SmallTalk (and anything too OO specific)
  • MAYBE?:  C, C++

When the Pi kills your program.

Yesterday I wrote a simple C program to perform median filtering on a 4352 × 7712 pixel grayscale image (taken from a Nokia Lumia 1020 41MP camera). The Macbook Pro with the 2.5 GHz Intel Core I5 processor (8GB memory) took between 200 and 500 milliseconds to process the image, and produce a filtered image. I dumped the same code on my Raspberry Pi (model B), which contains a 700 MHz ARM1176JZF-S processor with 256MB of memory.

What happened? The Pi killed the program. I was a little befuddled to say the least. I have never had an OS kill a program. I even used the heap (mainly because the C program wouldn’t even run with static arrays of size 33562624 pixels × 4 bytes (int) ≈ 128MB of heap memory). Which was probably my downfall on the Pi. Largely because I created two of these images, which would have eaten up the whole memory. Of course the Pi just gives you the message: “killed”. Was it a algorithmic homocide? To investigate this further, I looked in the kernel log: /var/log/kern.log. What did I find?

raspberrypi kernel: [ 2164.651521] Out of memory: Kill process 1558 (a.out) score 905 or sacrifice child

And so it ends. Things that we take for-granted on “normal systems”, don’t work so well on constrained systems such as the Raspberry Pi. It forces us to rethink the way an algorithm is designed. The obvious solution is to reduce the size of the image – but is this realistic? The better approach may be to process the image in parts, perform some form of better memory management, or save the processed image to file rather than store it in memory.

Cobol – the elephant in the room

The elephant in the room is likely Cobol (and to a lesser extent Fortran). Still widely used in industry, yet often maligned in academic circles. Who needs to teach that languages anymore. Cobol isn’t well liked because of its verbosity, English-like syntax (who would have thought), and lack of structured programming. Yet it is said that more than 70% of the worlds business runs on Cobol. True, Cobol isn’t cool. It is criticized for being “out-of-date, not attractive, complex, and expensive”. Sure there isn’t a huge demand for people with an understanding of Cobol – but neither is there a glut of people who can decipher and re-engineer Cobol programs – see the crux is re-engineering, not coding per-se.

Think everyone will be porting their Cobol code to Java? Not likely. Here is a case in point – healthcare insurer BlueCross BlueShield (check out this Computerworld article written by Robert L. Mitchell). They process nearly 10% of all healthcare claims in the U.S., and run millions of lines of optimized Cobol to process 19.4 billion online healthcare transactions annually. Cobol handles transactional workloads on mainframes extremely well, and is hard to replace in this role. Beyond popular belief the mainframe is not dead – Unisys, Groupe Bull, Fujitsu, and IBM all still make mainframes.

Food for thought.

The reality is that in 10 years time, iOS will have been replaced by something even newer, just as application development on iOS devices morphs from Objective C to Swift. Such platforms and applications are transitional, i.e. they don’t have a long lifespan. Yet in 20 years, Cobol will still be with us – in some form or other.

Learn a legacy language if you want to differentiate yourself from everyone else. Or better still – learn a legacy language, AND how to re-engineer it.

Want to read more? – here’s a great article by Microfocus.

 

FORTRAN is dead! Long live Fortran.

FORTRAN is considered by many to be a dead language. Who even codes in *that* language anymore?

Surprisingly, many scientists do – for numerically intensive computing, including weather prediction models, climate models, and other fluid dynamic simulations. But why, when we have C and C++? The reality is that there is nothing inherently wrong with using Fortran to write programs in.  It’s a language, and it’s somewhat “older” than other languages, but are we going to be ageist about it? In fact in its 60-odd years Fortran has actually evolved – from an unstructured spaghetti-mess to a structured, well-designed language. See the keyword is – evolved – although C has had minor tweaks, there has been no attempt made to evolve-out its idiosyncrasies (dangling-else anyone?).

Amongst computer scientists Fortran has a reputation for being archaic. Few would code in it anymore, even though as far as languages go, Fortran is a versatile, efficient, modern programming language. Sure Fortran 77 was somewhat tawdry, and the unstructured nature of those Fortrans before F77 – downright nauseating. Some of these features hurt the language by impacting its readability, for instance: common blocks; implicit variables; DO loops with shared CONTINUE statements; GOTOs in place of DO loops; Arithmetic IF statements (effectively a goto statement) and computed GOTOs.

So what makes Fortran special? Largely, its ability to deal with mathematical issues, and its propensity to produce efficient code. Languages such as C require only a basic double-precision math library, whereas Fortran requires single AND double precision math, as well as quad-precision offered by many vendors. The latter is sometimes important to minimize roundoff errors. Fortran deals with arrays effectively, allowing arbitrary indexing, slicing, and whole-array operations. Fortran also has better I/O routines than languages such as C – use of read, write, and format (for formatting I/O), is simpler and yet more powerful than equivalent scanf/printf in C. For example, C requires explicit notification of types of data to be input, Fortran does not require it (but you can provided it). Fortran doesn’t rely on the use of pointers as much as C-family languages do. Although explicit pointers are lacking in earlier versions of Fortran, Fortran 90 does allow them, however they are restricted to point only to variables declared with the “target” attribute (more in an upcoming post).

Fortran is also quite easy to learn. Fortran 77 was easier for non-experts to learn than C, even though it still forced the user to write column-dependent code. This is largely related to the fact that the novice programmer did not have to concern themselves with pointers and memory addresses, something that is broached in C with the first scanf used. Not that it isn’t important to understand the use of memory, but not everyone is concerned with this. C was designed as a systems language, Fortran was not – its main goal was number crunching, and still is.

[1] Wirth, Niklaus, Algorithms + Data Structures = Programs, Prentice-Hall (1976)

The stack versus the heap

What does a program look like that just uses the heap to store its variables? Well, in truth, the stack is still used – you almost can’t avoid using it when designing a program. Here is a piece of C code that is implemented just using stack memory.

#include <stdio.h>

double multiplyIt(double aNum, double multP)
{
    double result;
    result = aNum * multP;
    return result;
}

int main (void)
{
    int i, what=24;
    double isIt[4] = {1.6, 3.2, 6.4, 12.8};

    for (i=0; i<4; i=i+1)
    printf("%.1lf\n", multiplyIt(what,isIt[i]));

    return 0;
}

On lines 12 and 13 of the main() function, three variables are created: two ints, and a double array with 4 elements. These are pushed onto the stack as soon as main() is activated. When the function main exits ,they are all popped off the stack and destroyed. Similarly, the double variable result in function multiplyIt() is pushed onto the stack when multiplyIt() is activated (and popped from the stack when multiplyIt() exits).

In the next example, the same code has been written with the heap in mind.

#include <stdio.h>
#include <stdlib.h>

double *multiplyIt(double *aNum, double *multP)
{
    double *result = malloc(sizeof(double));
    *result = *aNum * *multP;
    return result;
}

int main (void)
{
    int *i = malloc(sizeof(int));
    double *what = malloc(sizeof(int));
    double *result = malloc(sizeof(double));
    double *isIt = malloc(4*sizeof(double));

    *isIt = 1.6;
    *(isIt+1) = 3.2;
    *(isIt+2) = 6.4;
    *(isIt+3) = 12.8;

    *what = 24;

    for (*i=0; *i<4; *i=*i+1)
    {
        result = multiplyIt(what,&isIt[*i]);
        printf("%.1lf\n", *result);
    }

    free(i);
    free(what);
    free(result);
    free(isIt);
    return 0;
}

This program uses malloc() to allocate memory on the heap and then uses free() to deallocate it, which is not such a big deal – but it does take extra effort. All the variables are allocated on the heap: a pointer to an integer, two pointers to doubles, and a pointer to an array of doubles. This is likely the more efficient way of of coding this program, however it much more complex, with all the *’s and &’s all over the place.

One language to rule them all.

There is always one programmer who believes that the language they use is the best – that there is no other.

Define best.  Does it write the compiler write the code for you? Debug itself? Run copious tests on the programs it creates to make sure they are accurate?

Unlikely.

There are a myriad of languages, but the thing with programming is that it’s not a “one language to rule them all” kind-of thing. Cobol is good at processing financial data, sure it doesn’t have a mechanism for recursion, but nor does it need it. C is great for system level programming. Matlab deals with matrices *really* well. Fortran is good at crunching numbers. Ada is good in real-time situations. Python is good at gluing things together. There is no necessity to port everything to Java.

Every programming language has things that it excels at, and things it doesn’t.

P.S. If there was one language to rule them all it would of course be Cobol.  70% of the existing infrastructure runs on COBOL. From using a debit card, to making an airplane reservation, or making a call on a mobile. All those significantly depend on Cobol… which means it really does rule… our lives!

 

C obscenities – weird array notation

There are things one discovers about C that are just plainly stupid. Array notation falls into that category. What seems simple in some languages is made challenging in C due to the multiple ways an array element can be accessed. Part of this is attributable to the relationship between arrays and pointers, which is kind of murky to say the least. In the most basic sense, a C array declaration of the form:

int array[20];

will create an integer array containing 20 elements. The ith array element is then accessed in the following manner: array[i]. What this is really doing is accessing the memory at that particular location, which looks something like this (in pseudocode):

address(array[i]) = address(array[0]) + i * size(int)

For any array in C, that first element, array[0] is the linchpin, allowing access to the rest of the array. So setting the ith element of an array to the value 12 is normally achieved in the following manner:

array[i] = 12;

However when Kernighan and Ritchie designed C, they tried to create a unified treatment of arrays and pointers, one that would expose the “pointer-like” qualities of arrays using the “decay convention” – an array is treated as a pointer that points to the first element of the array. Therefore taking a subscript with value i is equivalent to the operation: “pointer-add i and then type-dereference the sum”. For example:

array[i] = *(array + i)

Similarly, for a 2D array :

array[i][j] = *(*(array + i) + j)

In many languages, this type of notation is “hidden” from the programmer. As array dimension increase, so too does the complexity of the equation which can be used. So here are the myriad of ways that the value 12 can be assigned to the 3rd element of an array a.

a[2] = 12;
*(a+2) = 12;
2[a] = 12

Whooooooa… what is that last piece of notation – THAT can’t be legal? Oh but it is. In other languages the use of such an obscenity would result in a syntax error of some sort. Not so in C. Due to the [] operator being defined as above, the following are equivalent:

array[i] = *(array + i)
array[i] = *(i + array)

which implies

*(i + array) = i[array]
array[i] = i[array]

This is a direct artifact of arrays behaving like pointers. Nasty… and confusing.

Why global variables are considered harmful.

Like the infamous goto statement, the use of global variables in programs has been hotly debated over the past decades. Why? – because if used incorrectly they (like goto) can lead to bad programming practices. So what is a global variable? Well, a global variable is one whose scope is effectively the whole program. For example:

#include <stdio.h>

int aGlobal;

int aFunc(int a)
{
    printf(“%d”, aGlobal);
}

int main(void)
{
    printf(“%d”, aGlobal);

    return 0;
}

Here the variable aGlobal is a global variable – it can be accessed both in the function aFunc, and main. Normally, variables stored on the stack are temporal in nature, i.e. when the function aFunc ends, all its variables are destroyed, and the memory associated with them in the stack is reclaimed. Global variables aren’t stored in the stack, but rather in a secluded piece of memory sometimes referred to as static memory (also where static variables are stored, since they must retain their value between invocations of a function).

So why are global variables bad? Firstly they are non-local, which means they are often harder to keep track of, as they can be modified anywhere in the program – anywhere. As the number of parts in the program that use global variables increases, so too does the complexity of the interactions between program components (i.e. increased coupling). One of the purposes of modularity is to decrease coupling in order to improve program maintainability.  Another problem is namespace pollution, in other words variables may be unknowingly used because you think you are using a local variable (with the same name). Take for example the following piece of code:

#include <stdio.h>

int a = 6; //global

int main (void)
{
    int a = 12; //local
    printf("%d\n", a); 

    return 0;
}

What happens here? The value 12 is output, with the local variable a declared in main effectively overriding the global variable a.

So where can they be used?

There are situations where the use of a global variable becomes more meaningful. A good example is the creation of a user-defined structure which must be accessed throughout the program. Consider the example below, of a struct created to store a linked list. The structure can be access as a parameter to a function, as a structure within a function, or a structure within main. It would be impossible to create such as struct in main and have it’s core data structure available to functions elsewhere. For example:

#include <stdio.h>

struct node {
    int item;
    struct node *next;
};

int makeList(struct node *ptr)
{
    struct node *temp;
    // ...
}

int main(void)
{
    struct node *list;
    // ...
    return 0;
}

However in reality, such structures, as well as functions may be better situated in a header file.

[1] Wulf, W.A., Shaw, M., “Global Variables Considered Harmful”, ACM SIGPLAN Notices, 8(2), pp.80-86 (1973).

 

Code Tricks: Swapping values without a third variable

The classic way to swap two values in most languages (C code shown) is by means of using a temporary third variable. For example:

void swapBYtemp(int *a, int *b)
{
    int temp;
    temp = *a;
    *a = *b;
    *b = temp;
}

Python supports parallel assignments of the form:

a, b := b, a

Swapping can also be achieved without the use of a temporary variable in at least two other ways: (i) XOR swap, and (ii) arithmetic swap.

Bitwise XOR

In the first case, swapping using bitwise XOR, for example if a = 2 (010), and b = 4 (100)

a = a XOR b = 010 XOR 100 = 110
b = b XOR a = 100 XOR 110 = 010
a = a XOR b = 110 XOR 010 = 100

The associated C code would look like:

void swapBYbit(int *a, int *b)
{
    *a ^= *b;
    *b ^= *a;
    *a ^= *b;
}

The use of this technique in swapping values is limited to [embedded] situations where memory may be extremely constrained, however in most cases bitwise XOR may be more expensive than the traditional swap.

Arithmetic swap

The second case involves swapping by using arithmetic, of which are are two forms. The first form involves the use of addition and subtraction:

void swapBYmath1(int *a, int *b)
{
    *a = *a + *b;
    *b = *a - *b;
    *a = *a - *b;
}

The second form involves multiplication and division:

void swapBYmath2(int *a, int *b)
{
    *a = *a * *b;
    *b = *a / *b;
    *a = *a / *b;
}

Note that swapBYmath2 does not work if one of the numbers to be swapped has the value 0, as the result of the first line of code will result in a zero. Both these methods may also result in some form of arithmetic overflow, if large enough numbers are being swapped.

Caveats

If an attempt it made somewhere in an algorithm to swap two variables that point to the same memory location, the methods that don’t use a third variable will fail. For example for bitwise XOR swap, if the addresses are equal, the algorithm will fold to a triple *x ^= *x resulting in zero. So if you are going to incorporate these algorithms in your code, make sure to add some defensive programming.

void swapBYmath1(int *a, int *b)
{
    if (*a == *b)
        return ;
    *a = *a + *b;
    *b = *a - *b;
    *a = *a - *b;
}