C and core dumps.

In the U.S.S. Enterprise, warp cores tend to “dump” a lot (alright… it gets ejected – but dump, eject, jettison, all mean roughly the same thing!). But those are warp cores, and vastly different to the cores you find on Unix systems. When I first learned to program C on Unix, core dumps happened quite often when there was some sort of program crash. The term “core” is attributed to the magnetic core memory found in computing systems of the 1950s to 1970s. Core dumps are triggered by a fatal error of some sort, and effectively are a dump of the programs memory “space” into a file. On early systems, we compiled code in /tmp to avoid running out of quota space, because /tmp was *huge*.

Here is a piece of code which creates a core dump:

</pre>
#include <stdio.h>

int main(void)
{
    char *seg_flt = "winter";
    *seg_flt = 'X';

    return 0;
}

If your system does not create a core dump (and by default most don’t these days), you’ll have to activate it. In Bash Shell on a Raspberry Pi, this involves enabling “ulimit -c unlimited“. If the above code is then compiled and executed, a “Segmentation fault (core dumped)” message is indicated to the user. The string literal “winter” is stored in read-only memory, and the variable set_flt is set to point to the string. When an attempt is made to write the character X to the variable, a segmentation fault occurs. Looking at the file core:

-rw------- 1 pi pi 217088 Nov 25 00:01 core

What can be done with core dumps? They can be analyzed using gdb, but it is not for the faint-hearted.

 

Infinite recursion and the stack

One of the issues with recursion is the possibility that it will go postal, which is probably why code containing recursive structures is often banned from things like aerospace applications. Infinite recursion is probably the worst-case-scenario – supposedly recursion that is limitless. However the opposite is true – recursion relies on the use of the stack, and stacks are not infinite. Every time a recursive function call occurs, a stack frame is created for that function. As this continues, the stack begins to fill up, eventually leading to a stack overflow. Here is a super simple piece of code that will do just that:

#include <stdio.h>

int main(void)
{
    main();
    return 0;
}

Nothing to it really, except that compiling and running this code will simply produce a “Segmentation fault” error, quite quickly in fact. Can we dig a little deeper? Sure, and it’s as simple as adding a variable, say an integer to the mix, and printing out its address. The code from above can be modified in the following manner:

#include <stdio.h>

int main(void)
{
    int x;
    printf("x lives at %p.\n", (void*)&x);
    main();
    return 0;
}

When compiled on a Raspberry Pi running Linux, this program prints out the memory address of the variable x – every time a recursive call to function main is activated. Don’t watch the screen, it will take a while to process – instead redirect the output to a file. Looking at the output file, the first x is stored at location 0xbe8266ec, giving a rough indicate of where the stack starts in memory. Calculating the word count of the file, returns 523,931 lines which literally means that the function main was called 523,931 times before the stack overflowed, and the program “seg-faulted”.

 

History of Computing 101: Preparing a program… 60 years ago

In 1954, MIT held a “Summer Session” on “Digital Computers – Advanced Coding Techniques“, a likely prelude to the conferences that would follow. This was the era which spawned the first true “programming” languages, so it is an interesting read. In the introduction the writer, one C.W. Adams, describes the process which is required in preparing a program for a computer:

  1. analyzing problem
  2. planning
  3. coding
  4. typing (or keypunching)
  5. trying
  6. debugging
  7. running
  8. analyzing results

Ironically, sixty years on, the process of writing a program is not vastly different. He goes on to say “the running may be made more efficient by careful coding“, thereby reducing the computer time. Programming languages were ostensibly developed in part “due to the need for simplification of coding to accommodate the new and,the non-professional programmers – the amateurs who regard programming merely as a necessary evil“.

Why C is NOT a memory safe language.

When you become more experienced in C programming you eventually hit pointers, and the memory problems associated with them. Well, let’s face it, the minute you start dealing with arrays and strings in C there is a chance of something going awry. Consider the following piece of code:

#include <stdio.h>
#include <string.h>

void buffer_overflow(char *str)
{
    char buffer[10];
    strcpy(buffer, str);
}

int main (void)
{
    char *str = "Do or do not, there is no try";
    // length of str = 30 bytes
    buffer_overflow(str);
    return 0;
}

When this code is compiled (with gcc) and executed on OSX: it fails, returning “Abort trap: 6“.  This implies that the program failed due to a SIGABRT signal, in this case #6. Why did it fail – well, likely because of the buffer overflow that occurred when an attempt was made to copy the contents of the string str into the string buffer – that latter only has enough room for 9 characters, and hence the program fails. The program compiled in a similar manner on a Raspberry Pi running Linux (gcc V.4.5.1) will produce the ubiquitous “segmentation fault”.

Another problem is the  dangling pointer, which points to memory that has already been freed, i.e. the storage is no longer allocated. Consider the following piece of code:

char* faulty_string()
{
    char str[20];
    strcpy(str,"Fawlty Towers");
    return(str);
}

int main (void)
{
    char *str;
    str = faulty_string();
    printf("%s", str);
    return 0;
}

The function faulty_string does not work as expected because the string being returned is a local variable. The programs behaviour at this point is somewhat undefined. In this case it printed “� “, but it could just have easily crashed. The code would have worked had str been a dynamic array. Thankfully, most compilers provide some form of warning that this is happening, usually in a message of the form: “warning: function returns address of local variable” (even without the use of -Wall in gcc). Another form of dangling pointer is when a memory location is accessed after it has been freed. A good example of this is:

int *aNumber = malloc(sizeof(int));
free(aNumber);
*aNumber = 12;
printf("%d", *aNumber);

And when compiled, this code *actually* runs with no issues, printing the value 12. Alternatively one could also experience a memory leak, which signifies heap memory which has not been freed, but is now out of scope, so there is no way to access the memory – basically orphaned memory. A good example is shown in the code below:

void memory_leak()
{
    char *str;
    str = (char*) malloc(100);
    // ...
}

As the string str can not be accessed outside the function memory_leak, the 100 bytes is essentially lost. Too many memory leaks, and the system itself will start to show problems.

 

It’s a trap! – Inputting sentences in C programs

If you are inputting in a sentence into a C program, then there are a number of choices: scanf, gets, and fgets. Let’s consider that the string being input is “winter is coming”. The lamest choice is probably scanf, which will only read in the first word of the sentence.

char str[100];
scanf("%s", str);

The remainder of the string will remain in the buffer (unless you use the code trick published previously). The most obvious choice is gets(), but it has a real shady background. When compiling a program using gets(), no warnings will show up (at least not with gcc)… however when the program is run, the following warning appears “warning: this program uses gets(), which is unsafe.” Long story short, the gets() function does not perform bounds checking, therefore it’s extremely vulnerable to buffer-overflow attacks. For example, consider the code below:

char str[100];
gets(str);

If less than 100 characters are entered as input, there is no problem what-so-ever. However, if more than 99 characters are entered, gets() will not stop writing at the end of the string. Instead, it continues writing past the end and into memory it doesn’t own (scanf is known to do similarly stupid things). The problem will manifest itself in a number of ways: immediately crashing, incorrect program behavior, or possibly no visible effect – it really depends on the amount of extra text entered. Here’s a piece of code to illustrate this phenomena:


#include <stdio.h>
#include <math.h>
#include <string.h>

typedef struct person
{
    char name[5];
    int dob;
} person_t;

int main(void)
{
    person_t me;

    me.dob = 1970;
    printf ("me.dob is %d\n", me.dob);
    printf ("Enter the persons name: ");
    gets(me.name);
    printf ("me.name is %s\n", me.name);
    printf ("me.dob is %d\n", me.dob);

    return 0;
}

Now here’s the program running, with the input for the string name being “Skywalker”, clearly larger than the 4 characters available.

me.dob is 1970
Enter the persons name: Skywalker
me.name is Skywalker
me.dob is 114

Notice how the value stored in dob has changed? Not the best scenario in the world. This is worse here because of the use of a struct, whereby the memory for the fields name and dob are closely coupled together. So the ay to fix this problem is “you don’t need to use gets(), try fgets(), now move along”. fgets() is just the file version of gets() – however is allows control over how many characters are read into the string.

char str[100];
fgets(str,sizeof(str),stdin);

In this case fgets() reads in at most one less than sizeof(str) (or 99) characters from the input stream stdin (standard input) and stores them into the string pointed to by str. Reading stops after an EOF or a newline (Enter). If a newline is read, it is stored into the buffer. A ” is stored after the last character in the buffer. The one caveat with this is that fgets() *may* place the newline character into the string  if there is enough room to store it. This can cause problems further down the road when processing the string. For example, if we were to enter “winter is coming”, this is what would be stored:

winterIScoming

The length of this string is actually 17, because of the ‘\n‘ stored at the end. If the length of str were 17 when it was created, it would not be a problem. So to avoid this, the ” should be moved up one element in the string using some code of the form:

str[strlen(str)-1] = '\0';

But only if the length of the string input is less than the declared string size (minus 1).

 

Code Tricks: Strings of words with scanf

One of the caveats of scanf and strings is that they only read characters up until the next space encountered. To read a full sentence of characters, you often have to use fgets. However, there is another way of reading strings with blank characters: the %[..] specifier, which reads a string of words. The use of %[c] means that only the characters specified within the brackets (c) are permissible in the input string. If the input string contains any other characters, the string is terminated at the first instance of the character. The specifier %[^c] does the reverse. For example:

char sentence[80];
scanf(“%[^\n]”, sentence);

This will read all the characters in a sentence until a newline is encountered, effectively storing a whole sentence including spaces and storing it in the string sentence.

Code Tricks: The * I/O modifier

The * character can be used within the format strings of both printf and scanf. Instead of using a number for the field width in a specifier in a printf statement, you can use *. The value still has to be supplied, but this way it is supplied with a value passed to printf. For example:

int num, width;
printf(“%*d”, width, num);

This uses the value supplied by the variable width to specify the field width. The * in scanf serves another purpose. When placed between the % and the type specifier, it causes scanf to skip over the corresponding input. For example:

int num;
char ch;
scanf(“%d%*c%c”, &num, &ch);

If the user enters “12 b”, this causes the integer 12 to be stored in the variable num, the next character (a space) is discarded, and the character “b” to be stored in the variable ch. This deals nicely with the problem of %c reading the next character in the buffer, which in this case is a space.

Code Tricks: Printing fancy characters in C

Sometimes in output we might want to use something like the Greek character π. Printing the extended characters can be quite challenging in C. Although many systems don’t support the extended character set, it can be achieved using UTF-8 character encoding. It is able to represent any character in the Unicode standard, yet it is backwards compatible with ASCII. UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.

For instance in Xcode, we can print characters to stdout using:

printf(“%c%c\n”, 0xcf, 0x80);

This prints the character equivalent to the UTF-8 code cf80, which is π. The next example prints the phrase πr²:

printf(“%c%cr%c%c\n”, 0xcf, 0x80, 0xc2, 0xb2);

Here are some of the more commonly used codes:

0xcf, 0x80       π
0xc2, 0xb0       °
0xc2, 0xb2       ²
0xc2, 0xb3       ³
0xc2, 0xbd       ½

The best language for a novice to learn programming?

Is there such a thing as an ideal language with which to learn programming? Probably not.

Learning to program is a process whereby one learns, not only the syntax of a particular language, but also the constructs which form the basis of writing programs. In many respects, the language then helps to provide a conceptual foundation in programming. For example, a decision construct in the form of the if statement exists in nearly every programming language, and can be used to formulate the idea of decision making in an algorithm. Is there one particular language where the if stands out as being above all other if‘s? No, not really. There are some languages in which there are more side-effects, or have more idiosyncrasies that much is certainly true. For example C, has issues with the “dangling-else” in code of the form:

if (n > 0)
    if (a > b)
        c = a;
else
    c = b;

The else statement will be associated with the inner if statement, rather than the outer one, due to C’s association rules. To fix it would require embedding the inner if statement inside a block. Fortran, on the other hand, uses delimiters for its control structures. So the above code would be written as:

if (n > 0) then
    if (a > b) then
        c = a
    endif
else
    c = b
endif

Even easier is Python, which uses indenting to specify association. Python also has the benefit of forcing the user to indent properly, a behaviour which can be easily transferred to other languages.

if n > 0:
    if a > b:
        c = a
else:
    c = b

When learning to program the emphasis should be placed on the conceptual ideas: decision and repetitive structures, how to represent data, modularity etc. Too much emphasis on a languages specific syntax due to language oddities can detract from learning these concepts properly.

There is also the notion of transitioning from one language to another. Yes, learning C will help you transition to C++, or Java, but once you understand the basic concepts of programming, learning another language shouldn’t be that difficult. Lastly there is marketability – “knowing this language will help me get a job”. Actually, knowing an assortment of languages will better help you get a job.

So which language is the best for learning how to program? For the novice who is interested in learning the core programming concepts, the most learnable language is likely Python. Python allows for fast, efficient programming, and produces code which is imminently readable. It also has some features which make constructs easier to learn, for example passing information to and from functions (without C’s taxing use of pointers). It also allows for easy programming on devices such as the Raspberry Pi.

For the novice who is more scientifically inclined, then the language of choice might be something new like Julia. Julia allows for the simplicity of Python’s, without the slowness, and incorporates many constructs which make programming easier to learn, such as control structures terminated with the keyword end. Unlike C, Julia only offers two forms of loop, the for and while, and an if-elsif-else construct, as well as exception handling. Julia also has arrays whose indexing starts at 1 – YEAH, that’s what I’m talking about!

For the novice who is historically inclined, then Pascal is a good choice. Pascal does some things much nicer than many contemporary languages. One of those things is the difference between assignment and equality, which in Pascal are := and = respectively. Pascal was also designed as a teaching language, and has a somewhat English-like syntax.

Software complexity and bugs in the code.

Consider this: Holzmann[1] estimates 50 software errors remain in every 1,000 lines of recently written code, and code that has been thoroughly tested still contains 10. McDonnell[2] estimates the industry average at 10-50 per 1000 lines of delivered code. Should we be frightened? Sure mobile devices, and desktop machines aren’t likely to cause fatal errors that lead to physical devices going haywire, but software controlling objects that move just might.

Android OS has about 10 million lines of code. Now consider the software that runs Paris Metro Line 14: 87,000 lines of Ada. It controls the line’s train traffic, regulates the train speed, manages several alarm devices and allows for traffic of both automatic and non-automatic trains on the same line. One piece of software controls moving trains, and the other controls a mobile device.

The F-22 Raptor consists of about 1.7 million LOC, the F-35 Joint Strike Fighter, 5.7 million LOC. Boeing’s 787 Dreamliner, requires about 6.5 million LOC to operate its avionics and onboard support systems.

Food for thought.

[1] Holzmann, G.J., “The logic of bugs”, Foundations of Software Engineering, pp.81–87 (2002).
[2] McDonnell, S., Code Complete, (2nd ed.) Redmond, Wa.: Microsoft Press, (2004).