How efficient is your code?

How often do you think about how efficient your code is? Coding on an average computer is easy because of all the resources at hand. Lots of disk space, lots of memory. The average programmer probably doesn’t think much about efficiency until they have to work on software for mobile devices and the like. Sure solid state memory is quite good on these devices, but let’s face it battery life isn’t, and that’s the crux of many apps – too complicated and they will churn over the processor, which in turn will eat up the battery. This is more so if one uses the camera a lot. Imagine stepping back to the early 1970s and having to develop code for the Voyager deep space probes.

The hardware on both voyagers had 69.63 kilobytes of memory. The processors in the Flight Data System could manage 81,000 operations a second. To put that into context, the A13 Bionic chip in side many of Apples newest phones can manage 5 trillion operations per second. One device is over 21 billion kilometres from earth, the other doesn’t move anywhere (or really do much of anything).

Principles of program efficiency

Most people treat electricity as an limitless resource. We use it to power up our mobile devices with very little thought to whether it is green or not. It tends to be cheap, so we don’t think twice about it. Programmers tend to treat CPU cycles, and memory in a similar way. It wasn’t always this way.

First, an observation by Gordon Bell:

The cheapest, fastest, and most reliable components of a computer system are those that aren’t there.

Which in an ideal world would make programs omnipotent. But computers exist, and so do programs. Yet for the most part, programmers are warned off playing around with a programs efficiency. Consider Donald Knuth’s 1974 paper “Structured Programming with Goto statements” [1] in which he wrote:

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. (…)

Yet have programmers forgotten about efficiency altogether? Many people tend to forget that even in a techno-obsessed world where memory in one form or another is cheap, not all software is efficient. Ever downloaded a webpage, or app that just consumes copious amounts of resources? On mobile devices this can be devastating, draining the battery. Even in digital cameras, some functions are battery bleeders, and best used sparingly. Some algorithms are just impossible to make more efficient, some programmers write horrible code. The basics of program efficiency rely on understanding the algorithm and the environment it is to be used.

[1] Donald Knuth, Structured Programming with go to Statements, JACM Computing Surveys, Vol 6, No. 4, Dec. 1974, p.268

 

Timing and efficiency in C programs (ii)

TIMING WITH NANO SECONDS

Now a nanosecond is one-billionth of a second. That’s pretty small, and essentially regardless of the system, a version of clock_gettime() can be cobbled together. The library time.h contains a structure timespec, which has the following members:

time_t tv_sec    /* seconds */
long   tv_nsec   /* nanoseconds */

If the system is OSX, then the following code will include the appropriate libraries (in addition to stdio, math, time, and sys/time):

#ifdef __APPLE__
#include 
#include 
#endif

Then a function can be built to return the time, stored in a timespec structure, using the Mach clock.

void get_Mach_time(struct timespec *ts) {

#ifdef __APPLE__ // OS X does not have clock_gettime, use clock_get_time
   clock_serv_t cclock;
   mach_timespec_t mts;
   host_get_clock_service(mach_host_self(), CALENDAR_CLOCK, &cclock);
   clock_get_time(cclock, &mts);
   mach_port_deallocate(mach_task_self(), cclock);
   ts->tv_sec = mts.tv_sec;
   ts->tv_nsec = mts.tv_nsec;
#else
   clock_gettime(CLOCK_REALTIME, ts);
#endif

}

On line 6, the code gets a Mach clock port using the function host_get_clock_service(). The parameter CALENDAR_CLOCK is time since the epoch (1970-01-01). The port is returned in cclock. The function clock_get_time() is then used to get the time, which is returned in the variable mts, a structure which is the same as the timespec struct from time.h. The code on line 8 deallocates the port. The code on lines 9 and 10 assign the seconds and nanoseconds data to the timespec struct ts, which is returned from the function. This is used in the following manner:

int main(void)
{
struct timespec ts1, ts2;
unsigned long ts1nano, ts2nano, diffN;

get_Mach_time(&ts1);

   // code to do something

   get_Mach_time(&ts2);

   ts1nano = ts1.tv_sec * 1000000000 + ts1.tv_nsec;
   ts2nano = ts2.tv_sec * 1000000000 + ts2.tv_nsec;
   diffN = ts2nano - ts1nano;

   printf("%lu nanoseconds\n", diffN);
   printf("%.6lf milliseconds\n", diffN/1000000.0);

return 0;
}

The calculates are performed in nanoseconds,  and then output in both nanoseconds and milliseconds.

EXPERIMENTS

Consider some experiments using Ackermann’s function. Calculating ackerman(4,1) using all three methods results in the following:

clock           8565 milliseconds

gettimeofday    8700 milliseconds

get_Mach_time   8593.151 milliseconds

Timing is obviously somewhat different because the algorithm had to be timed separately three times.

Timing and efficiency in C programs (i)

What would algorithms be without time. Efficiency. Sure, there will always be faster machines to run algorithms on, so efficiency may not be that big a deal. Or is it? Certainly resources on a mobile device, or a probe going into space are limited. More resources = more energy requirements = bigger batteries = less room for increased functionality (i.e. higher resolution cameras).

So how do we calculate how fast an algorithm runs? Firstly, what are the properties of the speed calculation – real/user/system time? seconds/milliseconds/microseconds?

THE MOST BASIC TIMING REGIME

The most basic timing uses the time.h library, and the clock_t structure. It allows timing using seconds, and the function clock(), which provides a minimum amount of feedback.

The function clock() returns the sum of user and system time represented as CPU time in cycles, but modern standards require CLOCKS_PER_SEC to be 1000000, giving a maximum possible precision of 1 µs. Here is a sample piece of code:

clock_t start, end;
double timeshift;

start = clock();

// Perform some operation

end = clock();

timeshift = (end - start) / CLOCKS_PER_SEC;

The variable timeshift will contain the timing for the algorithm in seconds. This works okay, but if the time resolution between algorithms is finer, then chances are this will produce the same time. To calculate the value in milliseconds:

timeshift = ((end - start)*1000) / CLOCKS_PER_SEC;

TIMING WITH FINER RESOLUTION

A better measure of timing can be had using the sys/time.h library and the timeval structure. The structure looks like this:

struct {
   int32_t tv_sec; /* Seconds */
   int32_t tv_usec; /* Microseconds */
} ut_tv; /* Time entry was made */

struct timeval ut_tv; /* Time entry was made */

The function gettimeofday() returns the current wall clock time.

struct timeval t1, t2;
long sec, msec, tmilli;

gettimeofday(&t1,NULL);

// Perform some operation

gettimeofday(&t2,NULL);

sec = (t2.tv_sec - t1.tv_sec);
msec = (t2.tv_usec - t1.tv_usec);

tmilli = (1000 * sec) + (msec * 0.001);

printf("%ld milliseconds\n", tmilli);

This timer has 1 microsecond resolution, where 1,000,000 microseconds = 1 second.

Are there problems with this approach? Yes, some to do with NTP (network time protocol) and the fact that processes on the system can potentially change the timer, making it jump backward and forward in time.

Sometimes it returns a negative value. That’s right, your program was *so* efficient it executed before you even ran it. The problem is that gettimeofday() still can’t really tell the difference between a 2 microseconds and 3 microseconds difference. The problem lies in deriving the code. If the difference in time is derived as above, the code will work as long as the relationship t2.tv_usec > t1.tv_usec holds. However although the relationship t2.tv_sec > t1.tv_sec holds, the same is not true for microseconds. For example consider the following two timestamps:

t1 = 1370282687 434738
t2 = 1370282711 289326

Calculating the differences results in 24 for seconds, and -145412 for microseconds. A better way is to convert both elements of the structure to microseconds, before calculating the difference.

long t1_msec, t2_msec, diff;

t1_msec = (1000000 * t1.tv_sec) + t1.tv_usec;
t2_msec = (1000000 * t2.tv_sec) + t2.tv_usec;

diff = t2_msec - t1_msec;

Or, even better, use the function timersub(), also found in sys/time.h. It basically works like this:

timersub(&t2, &t1, &tv);
tMilli = (1000000 * tv.tv_sec + tv.tv_usec) / 1000;
printf("%ld milliseconds\n", tMilli);

The difference is calculated and stored in the structure tv, which can then be output in microseconds. Note that microseconds may be too fine a resolution, however dividing by 1000 gives a measure in milliseconds.

Note that on some systems (Linux, BSD) gettimeofday() has been replaced with clock_gettime() (not OSX).

Free smart thermostats? – no free lunch

As I mentioned in my previous post, Ontario is pushing homeowners to install “smart” thermostats, for controlling their heating and cooling, by making them free. But there is no free lunch. The Government says people will save up to 15% on energy bills. Will these savings be realized? – Likely not. Is this to make us all feel better because of the government’s energy boondoggle? My house is already fairly efficient, I have energy efficient appliances and all my light bulbs are LED. In reality annual demand on Ontario’s power grid has declined over the last 10 years.

But what are the problems with something that is free?

  1. Security – Nest claims that its thermostat “knows when you’re away”. Yeah, so if it knows when you’re away, then chances are someone else could find that out too. Cars can be hacked, why not houses? Some would argue its just a matter of time. How good is the security on wireless thermostats? It’s also easy for companies to modify things at will – I found this out the hard way when Nest turned the “hand-wavy” feature off on the Nest smoke detector I bought. Do we need this sort of control?
  2. Privacy – How will the data be used? Data, it’s the new gold. If you think companies (or the government) won’t use it, then think again.
  3. Claims on energy saving – I tried the Nest “Real Savings” calculator – and it told me I could save $46-$127 a year. Not exactly an exact number. The problem is that it doesn’t take into account the envelope of my house – how well insulated it is, whether or not I have the right size of furnace, do I have in-floor heating in some rooms etc. It has no clue, so I guess it is better to make broad calculations. It also doesn’t know how I manipulate the thermostat on a daily basis – turn it down at night, only turn the air conditioner on when the house becomes humid in summer, keep it off when on vacation etc. Sure, I live in a small semi-detached house, so it’s convenient… but how well do you think a smart thermostat works in a 4000 ft² house?
  4. Greener living – People think that smart thermostats will promote greener living. Not. It’s not like the thermostat is going to suggest “Maybe you should downsize, because your huge house is not energy efficient”. Greener living means smaller living, buying less crap, not purchasing 24 packs of bottled water.

Let’s try something new – making people smarter.

Efficiency considerations in loops

it is rare these days to think about efficiency considerations in loops, I mean if an algorithm runs in 0 milliseconds it couldn’t get better could it? Likely not… on a fast machine. Consider however that you write a similar loop structure for a process that is much more computationally intensive. Getting into the habit of ignoring efficiency will have repercussions for instance when you port your algorithm to a mobile device. In for loops this is especially true of the inner portion of nested loops. Therefore an effort should be made to keep computationally intensive operations out of inner loops. Consider the following piece of code:

sum = 0
for i = 1:n
    for j = 1:i
        for k = j:i
            sum = sum + f(i,j) * g(j,k)
        end
    end
end

The problem with this piece of code is that the function f(i,j) is calculated every time the innermost loop is run, yet it has no dependencies with this loop. For instance if the value of n is 100, the innermost loop will execute 171,700 times. The efficiency of this loop will improve if the function f(i,j) is removed from the innermost loop.

sum = 0
for i = 1:n
    for j = 1:i
        fij = f(i,j)
        for k = j:i
            sum = sum + fij * g(j,k)
        end
    end
end

Now for n=100, f(i,j) is only calculated 5050 times. Now this only works if the computation time of f(i,j) is greater than that of g(i,j). If g(j,k) is heavily computationally intensive, then it might be better to rearrange the nested loop hierarchy. For instance in the following re-configuration:

sum = 0
for j = 1:n
    for k = j:n
        Gjk = g(j,k)
        for i = k:n
            sum = sum + f(i,j) * Gjk
        end
    end
end

Now g(j,k) is only calculated 5050 times. This makes sense if the order of f(i,j) is only n versus g(j,k) at . Such small changes may seem trivial on a system with plenty of resources, but on resource constrained devices such as mobile device it may make the difference between an app being successful or not.

 

Python is fast enough

I recently read in a post somewhere that one of the benefits of Python was that it is “fast enough”. The justification was kind-of lame though – efficiency bottlenecks could easily be addressed by writing some C code, and gluing it in. Really?

Okay, so maybe for a routine that’s super resource hungry I would implement it in C, and call it from Python. Maybe. But just maybe I wouldn’t even bother using Python for anything except making my data play nicer with the user. User inputs data, Python sends it to C, C processes the data, and Python outputs the result. The fact that Python is written in C makes no difference – who cares? Doesn’t make it any faster. Other things are written in C. Many compilers are.

Column-major vs row-major arrays: Does it matter?

The way that arrays are stored in languages doesn’t get much in the way of discussion.  C is row-major (as is C++, Pascal, and Python), Fortran is column-major (as is Julia, R and Matlab). But what does this really mean? It’s all about how the arrays are stored in memory: row major stores an array row-by-row, and column major stores an array column-by-column. Now in the overall scope of things, it doesn’t really matter which is used. Here’s what it means from a visual perspective.

rowcolumnarrays

Does it affect processing efficiency? Kind-of, but only in the sense that you should always traverse the data in the order it was laid out. Basically in a 2D column-major array, it is the row indices that change the fastest. This means that if you are using a nested loop, it is more efficient to process the columns in the outer loop and the rows in the inner loop. In a 2D row-major array, the opposite is true. Of course if you mix up the order of the indices, it’s unlikely to make a difference in a small 2D array.

In a large 2D array however, there is a good chance some of it will be stored in the cache. Accessing a column-major array in the column-row order will be efficient because the array is stored sequentially in this manner (and the CPU pre-fetches data required next). In the figure above this means the data would be accessed as 1, 4, 7. Accessing it in row-column format will lack efficiency because the data is not stored sequentially. In the figure above, this means the data would be accessed as 1, 2, 3. This means more cache-work, and a reduction in efficiency.

In Julia this means that for a 2D array, the following code is more efficient:

dx,dy = size(img)
for j=1:dy, i=1:dx
    if (imgY[i,j] > imgCb[i,j])
        im_e3[i,j] = 255
    end
end

than:

dx,dy = size(img)
for i=1:dx, j=1:dy
    if (imgY[i,j] > imgCb[i,j])
        im_e3[i,j] = 255
    end
end

So know how the data is laid out in the language you are using, and access it accordingly.

Re-engineering loops for efficiency (ii)

There are a couple more ways of improving loop efficiencies:

Loop unrolling

Every iteration of a loop requires modification and testing of the loop variable. This over- head can be reduced if the number of iterations of the loop is small, say less than five. This is called unrolling the loop. The previous example in loop fusion, can be unrolled, further reducing overhead:

int i, a[1000], b[1000];
for (i=0; i<1000; i=i+2)
{
    a[i] = 0;
    b[i] = 0;
    a[i+1] = 0;
    b[i+1] = 0; 
}

The most extreme case is to eliminate the loop completely and work as sequential line code. For example:

for (i=1; i<=3; i=i+1) {
    x = x + sqrt(i);
}

would arguably be better expressed as:

x = x + sqrt(1) + sqrt(2) + sqrt(3);

removing loop-independent expressions

Expressions in loops that perform a calculation independent of the loop should be evalu- ated outside the loop. This is sometimes known as loop streamlining:

for (i=1; i<=1000; i=i+1)
    sum = sum + pow(x,4.0);

should be replaced by

powX = pow(x,4.0);
for (i=1; i<=1000; i=i+1)
    sum = sum + powX;

We remove the expression pow(x,4.0) because its calculation is independent of the loop and would otherwise be calculated 1000 times. Consider the following nested loop:

for (i=1; i<=100; i=i+1)
    for (j=1; j<=100; j=j+1)
        for (k=1; k<=100; k=k+1)

The nested structure loops 100 × 100 × 100 = 1,000,000 times, meaning that a statement such as pow(x,4.0) would be executed 1 million times. For a nested loop struc- ture, the deeper a loop is nested, the higher the dividend with respect to efficiency. Always optimize inner loops first. The same principle can be applied to nested loops, by moving independent expres- sions from an inner loop outward. For example, some repeated calculations make use of the subscript as part of the calculation:

for (i=1; i<=100; i=i+1)
    for (j=1; j<=100; j=j+1)
        A[i][j] = B[i][j] + c/i;

The calculation c/i is a repeat calculation and should be removed from the inner loop. A better way to code this is

for (i=1; i<=100; i=i+1){
    ci = c / i;
    for (j=1; j<=100; j=j+1)
        A[i][j] = B[i][j] + ci;
}

Now the value c/i is calculated 100 times instead of 10,000 times.

 

Re-engineering loops for efficiency (i)

When it comes to program efficiency, most times people tend to ignore it. I mean computers get faster all the time, so inefficiencies in a program will eventually dissipate, right? One of the more interesting area for efficiency is loops. Why? Because loops more often than not are used to process large pieces of data, like images. Sometimes it is the language being used that is inherently inefficient when it comes to loops – prime example? Python.

Loop Fusion

The best way of cutting down the amount of time spent in loops is to decrease the number of loops, hence decreasing overhead associated with incrementing indexes and testing. This is termed loop fusion, jamming, or merging. For example, consider the following two loops, each of which initializes an array:

int a[1000], b[1000], i, j;
for (i=0; i<1000; i=i+1)    
    a[i] = 0;
for (i=0; i<1000; i=i+1)
    b[i] = 0;

These loops involve 2000 index increases and tests. This can be reduced by half by merging the loops:

int a[1000], b[1000], i;

for (i=0; i<1000; i=i+1){
    a[i] = 0;
    b[i] = 0;
}

LOOP PEELING

Loop peeling (or loop splitting) attempts to simplify a loop or eliminate dependencies by breaking a loop into multiple loops that have the same bodies but iterate over different contiguous portions of the index range. A useful special case is simplifying a loop with a problematic first (or first few) iteration by performing that iteration separately before entering the loop. Here is an example of loop peeling. Suppose the original code looks like this:

int vector[100], r[100], i;
for (i=0; i<1000; i=i+1){
    r[i] = vector[i-1] * vector[i];
}

The problem here is the fact that the loop can not safely handle the case of vector[i-1] when i = 0, for the index becomes -1, which is illegal. A better solution would be to peel the first iteration off:

int vector[100], r[100], i;
r[0] = vector[0];
for (i=1; i<1000; i=i+1){
    r[i] = vector[i-1] * vector[i];
}