Zero-based indexing considered harmful

Apart from goto statements, there aren’t many things that cultivate such heated discussion as the type of array indexing a language has. This article originally had the title “Array indexing: 0, 1 or arbitrary?”, but I thought I might spice it up by changing it to its current rendition. But the article isn’t about hitting on 0-based indexing. Typically languages take one of three options, allowing for 0-based, 1-based, or arbitrarily based array indices. Languages such as Python, C, C++, and Java have 0-based indexing, whereas Julia and Matlab are in the 1-based index club, and a whole slew of languages like Fortran and Ada prefer to allow the programming to make the decision, in cases even allowing negative indices. Is any one of them better than the other?

Some people can’t perceive of there being anything other than 0-based indices. For C it somewhat makes sense, as 0-based indices were are artifact of the fact the language uses pointers to store things, supposedly. More specifically, the expression array[x] in C refers to a memory location x elements away from the origin element, so the index is used as an offset. The first element is therefore array[0] to denote the fact that it is 0 elements away. But it likely wasn’t C that started the trend. C inherited its array semantics from its predecessor B (which called them vectors), which in turn inherited them from BCPL. Notably, BCPL supported 0-based indexing, and was itself derived from CPL (1963), which seemed to use arbitrary-based indexing [1]. Interestingly, in B [1], a vector v[10] allocated a vector v, of 11 consecutive words, v[0]..v[10], with the extra word being a pointer to the first element of the array.

The beast of two indexes

The truth of the matter is that before people started thinking about pointers and memory management, before the existence of Unix and C, many people were using either one-based or arbitrary-based indexing. That is before someone decided that array indexes should start at zero. It is therefore BCPL, created in 1967 by Martin Richards which seems to be the source of pure 0-based indexing. The BCPL was a typeless language close to the low end of machine does not make this all that surprising. Basically BCPL variables represented pointers to one of more consecutive words of memory [3]. So if b is a pointer, then b+1 points to the word after the one b points to, and hence b+0 points to the same value as b.

This is not surprising because as an array of 100 32-bit integers (4 bytes) elements is stored as 100 words at memory addresses: 1000, 1004, 1008,…, 1396. The element with index b, then can be calculated as 1000+(b×4). So it completely makes sense from the perspective of memory inside the machine. Languages that use 1-based indexing or arbitrary indexing just use some mechanism to offset the index. BCPL was compiled on an IBM 7094, one of the first machines with a time-sharing OS. Offset calculations for arrays were often done by the compiler, and hence 0-based indexing likely evolved to reduce the amount of compile-time required. This article by Mike Hoye outlines things a little more, but basically IBM’s use of computer time at MIT to calculate yacht handicapping meant that compiled programs had to be efficient, possibly contributing to the use of 0-based indexing.

Around the mid-1960s three essential approaches to indexing existed.

  • Zero-based indexing
  • One-based indexing – used some CPU instruction, or extra word to manage the offset.
  • Arbitrary indexing – range from X to Y, where X and Y can be positive or negative.

In reality, no one method of indexing is better than the other. There are situation where each method shines, and fails. In many languages a multidimensional array is stored sequentially. For example a 10×10 array, V[10,10], would be stored sequentially in memory from position v[0] to position v[99]. For languages that are row-major, v[0] to v[9] represent the first row, v[10] to v[19] represent the second row, etc. For column-major languages, v[0] to v[9] represent the first column, etc. So to find element (6,4) in the array using 0-based indexing means looking for element V[5,3]. This can be easily calculated using the equation 10r+c (where r=rows, c=columns), so 10×5+3 = v[53]. 0-based indexing works because if the element required is V[0,8], then memory position is v[8]. The same formula does not work for 1-based indexing, where V[1,9] would result in memory position v[19]. The formula would have to change to 10(r-1)+(c-1).

There is a reason why some languages choose 1-based indexes. From a scientific viewpoint, it may just be more logical. 1-based indexing in languages like Matlab and Julia may be because 1-based indexing is the convention for matrix notation. In languages using arbitrary-based indices, it may be the desire to provide the programmer with the ability to choose. Some problems lend themselves to use arbitrary indices. Take for example the Eight Queens Problem. In Niklaus Wirth’s classic take on the problem [4] he uses four arrays, two of which use arbitrary indices, b[2..16], and c[-7..7]. Why? Because as he explains the choice of indices was based on how the indices of b and c would be calculated. For the ⬋ diagonal of b, all the elements of each diagonal have the same sum of their coordinates 2..16. For the ⬊ diagonal of c, the coordinate differences are constant, -7..7. To have a language, Pascal which allowed this made for a natural implementation of the algorithm.

Consider the following piece of code written in C, which calculates the histogram of an image (which is 8-bit, i.e. 256 values from 0-255):

int img[1000][1000], hist[256];
...
for (int x=0; x<1000; x=x+1)
   for (int y=0; y<1000; y=y+1)
      hist[img[x][y]] = hist[img[x][y]] + 1;

Some argue doing this in a language such as Fortran is harder, but it isn’t really. Here’s the Fortran equivalent, which because it allows arbitrary indices, means the image can be 1-based, and the histogram 0-based.

integer, dimension(1000,1000) :: img
integer, dimension(0:255) :: hist
...
do x = 1,nrows
   do y = 1,ncols
      hist(img(x,y)) = hist(img(x,y)) + 1  

Supporters of 0-based indexing often say it’s more natural – in terms of memory addresses yes, in terms of human factors not so much. If an egg carton is empty, and somebody is asked “How many eggs are in the carton”, most likely the answer will be “none”, although some people may answer 0. If there are 12 eggs in a carton, the first egg is not egg zero, because it makes no logical sense – there are not 0-11 eggs in a carton, there are 1-12. Zero implies emptiness, nothing, nil, zilch, nix. Zero represents whether or not there are eggs in the carton. To have it represent the first element of an array then makes no logical sense from a human factors viewpoint. If there is one egg on the carton, it is not the 0th egg, it is the first. If there are no eggs in the carton, only the carton exists (this quickly turns into a discussion on the philosophy of zero).

Yes, there are instances where 0 exists in our lives. 24-hour clocks, and stopwatches start from 00:00:00, yet don’t stay there very long, and if something takes 0 seconds it didn’t really occur. Some people say rulers start at zero, but that argument is silly, because 0 is the edge of the ruler, and not really a measurement. A line of no length is not a line. Some people even put forth the argument of centuries – that the 1800’s is called the 19th century, however that is logical because it is the 19th century, i.e. the years 1-100 were the first century, 101-200 were the second century etc. Zero doesn’t even have a place (and there is no year 0). There are lots of these examples of numeric things in our lives, but none really relate to indexing, they just relate to zero.

There are many people who vehemently opposed to the idea of there being anything but zero-based indexing, and go to great lengths to justify their cause. They are often the same people who believe there is only one true language (whatever that may be). The reality is, who really cares what type of array indexing is used? Either that or go with an index-nonspecific language.

0-based indexingC, C++, Common Lisp, Haskell, Java, Oberon-2, Perl, Python, Ruby, Swift
1-based indexingAWK, Cobol, Fortran (def), Julia (def), Lua, Matlab, Algol-60, Algol-68
arbitrary indexingAda, Pascal, Fortran, Julia, Modula-2
Languages and indexing (def=the default setting)

  1. Barron, D.W., Buxton, J.N., Hartley, D.F., Nixon, E., Strachey, C., “The main features of CPL”, The Computer Journal, 6(2) pp.134-143 (1963)
  2. Kernighan, B.W., “A Tutorial Introduction to the Language B”, Bell Laboratories (1973).
  3. Richards, M., “The BCPL Reference Manual”, MIT, Memorandum-M-352 (1967).
  4. Wirth, N., Algorithms + Data Structures = Programs, Prentice-Hall (1976).

3 thoughts on “Zero-based indexing considered harmful

  1. nevillednz says:

    For the record, IIRC, both Algol60 and Algol68 allow arrays to be declared as “arbitrary indexing”, or maybe (def=the default setting). 

    Even AWK allows a zero index to an array (indeed an AWK array is really an unordered dictionary lookup).
    I can suggest that adding (def=the default setting) to both Algol60 and Algol68 to recognise their default?

    [ Key: `lwb/upb` is to mean the both the array lower and upper bounds ]

    The less obvious examples of “bias” might be where “1” as a lower-bound might be:
      * Algol60: The `outchar` function assumes the lower-bound of a `string` is `1`, eg. `outchar(1,”ABCD”,4)`
      * Algol60: When an array is passed as a parameter, then (unless the `upb/lwb` are “secretly” pushed by the specific implementation eg. “marst”) , what actually is pushed onto the call stack is the address “zeroth” element of the array, avoiding the need to pass lwb/upb… (n.b.: Matrices+ force the coder to pass the 1st `lwb/upb` on the stack)
      * Algol60: However: both bounds (including the arbitrary lwb) *must* be provided in declarations… eg. `integer array A[start:end];`
      * Algol68: the lower bound is omitted from a declaration. eg `[10]INT a` the default is 1, vs `[0:10]INT a`
      * Algol68: an array (or `STRING`) has been sliced. eg. `a[2:5]` vs `a[2:5][@0]`
      * Algol68: the builtin operator `ELEM` assumes `ELEM 1` is the 1st element for packed `BITS` and `BYTES`.
      * Algol68: the integer `CASE` statement (choice clause) assumes the integer `1` will select the `1st` statement. eg. `CASE 1 IN “A”,”B”,”C” OUT “Z” ESAC` yields “A”.
      * AWK: The function `substr` assumes the first `char` is at index=1
      * AWK: The builtin variable NR starts at `1`

    I suspect similar conditions to above also apply to all of the “arbitrary indexed” language eg. in Ada, Pascal, Fortran, Julia, Modula-2

    Maybe a litmus test might be to review how Ada, Pascal, Fortran, Julia & Modula-2 default, declare and index their own sub-strings.  Maybe they should enjoy a (def=the default setting).  Ironically, Algol60 coders have no optional lower bound default, and they are *forced* to declare arrays with their own specific lower bound.

  2. 8d5d2dderek (spoopy 🎃) (@852derek852) says:

    love having to to add and subtract one all because my language uses 1 based indexing /s

    converting x & y to offset:

    zero indexing:
    img(x,y,width) = img[y*width+x]

    one indexing:
    img(x,y,width) = img[(y-1)*width+x]

    converting offset to x & y
    zero indexing:
    x(offset,width) = offset%width
    y(offset,width) = offset/width

    one indexing:
    x(offset,width) = (offset-1)%width+1
    y(offset,width) = (offset-1)/width+1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.