A crazy command line

Here is a piece of code from unix.stackexchange which counts the number of occurrences of words in a text file.

sed -e 's/[^[:alpha:]]/ /g' vader.txt | tr '\n' " " |  tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl 

It basically does the following:

  1. Substitute all non alphanumeric characters with a blank space.
  2. All line breaks are converted to spaces.
  3. Reduces all multiple blank spaces to one blank space
  4. All spaces are now converted to line breaks. Each word in a line.
  5. Translates all words to lower case to avoid ‘Hello’ and ‘hello’ to be different words
  6. Sorts the text
  7. Counts and remove the equal lines
  8. Sorts reverse in order to count the most frequent words
  9. Add a line number to each word in order to know the word posotion in the whole

Now what it really does is show the power of Unix command line utilities.

Here’s the input file vader.txt:

I’ve been waiting for you Obi-Wan. We meet again, at last.
The circle is now complete; when I left you, I was but the
learner, now I am the master. Only a master of evil, Darth.

and the output:

1 4 i
2 3 the
3 2 you
4 2 now
5 2 master
6 1 when
7 1 we
8 1 was
9 1 wan
10 1 waiting
11 1 ve
12 1 only
13 1 of
14 1 obi
15 1 meet
16 1 left
17 1 learner
18 1 last
19 1 is
20 1 for
21 1 evil
22 1 darth
23 1 complete
24 1 circle
25 1 but
26 1 been
27 1 at
28 1 am
29 1 again
30 1 a

Advertisements

A cool software engineering degree

What should a software engineering degree look like? I think the Computer Science University of Gothenburg in Sweden has a clue. Their Software Engineering and Management degree is 6 semesters in length, broken up into the following segments:

  1. Team Programming
  2. Systems Development
  3. Distributed Dystems Development
  4. Cyber Physical Systems and Systems of Systems
  5. Global Enterprise Software Development
  6. Software Engineering Research and Practice

Succinctly put, their program starts with “application development in the small, and gradually proceeds to software engineering in the large”.

No huge emphasis on math, or hardware, or the high level theory normally associated with a traditional computer science degree – but rather a degree wholly focused on software engineering.

Now *that’s* refreshing!

Why long computer science degrees are passé

Some people won’t like what I have to say here, but here goes.

You don’t have to have a 5 year education in computer science to be successful in this field. The fact is we may spend too many years in “education”, and not enough years learning how things work in the real world. I do understand the concept of higher education, but I don’t believe it’s necessary to be successful. All the programming I learned over the years I taught myself. No-one ever spent any time teaching me to program. My first computer science class was taught by an instructor who few in the class could understand, and was surely talented in some aspect of computer science, but certainly not teaching introductory programming. I think we spent more time playing pool at the university bar than actually in class. I don’t think he even realized there were like 5 people in the classroom. So, I taught myself Pascal. It wasn’t exactly hard because there were ample books on Pascal, and the language itself wasn’t overwhelming.

This first year programming class was interspersed with a weekly class on computer math, which was presented by an instructor who was more mathematician than computer scientist – and it turns out, sometimes enjoyed playing Rogue more than turning up to class (in honours classes anyways). One of the funniest things happened one morning in class. Someone had put a phone on the front podium, and it rang. The instructor picked it up, but there was no-one there. This happened 4 times. It was like a skit from Fawlty Towers. But I digress. I took ten computer science classes as an undergrad – most of which I taught myself… reading books in the library (like this was years before the Internet). The same instructor that taught us computer math, taught a second year class, “Unix and C”,… in which we learned C programming by watching videos (i.e. VHS)… because the instructor could not code in C. In hind-sight, the only classes I took of any significance were with an applied math prof (thanks for the classes Gary Bunting), where I learned about numerical math and Fortran.

Honestly, if you enjoy programming you can teach yourself. There are so many resources these days, it is sometimes overwhelming. What you really need are things that years in higher education won’t provide – a zest for exploration, and an ability to think outside the box. In fact, academia may dampen these things with years spent writing design documents, and learning theoretical concepts, with little ability to express yourself. Computer science should only have 3-year degrees, and they should be completely experiential. That, unfortunately will likely never happen. I realize that 5-year degrees involve coop, which is an excellent way of doing a degree because it is experiential. But does it have to be 5 years?

The British do three-year degrees, and four year degrees with what they term “industrial experience”. They market the 3-year degrees as “the fast route to graduation”. Partially this works because these are focused degrees (i.e. very little in the way of breadth – you don’t have to necessarily take geography classes to pad out your degree). Here is the first year line-up of CS classes from the University of Birmingham (ranked 16th of 102 from The Guardian, 2017 University Guide).

  • Fall: Elements of Functional Programming, Introduction to AI, Language and Logic, Software Workshop I
  • Spring: Data Structures and Algorithms, Introduction to Software Engineering, Robot Programming, Software Workshop I

Germany offers 3 year degrees, Sweden offers 3 year degrees, even New Zealand offers 3 year degrees. Very few places in Canada offer three year CS degrees, but in an industry like computer science it just makes good sense.

The mind is not a vessel to be filled, but a fire to be ignited.
– Plutarch

Image binarization (8) : An example of differing approaches

Now, consider thresholding the following image of ants.

“Ants”

Why? Let’s say for some purpose of detecting or tracking these insects! It seems simple right? The ants are dark, the background is light, so segmentation should be simple using thresholding right? Let’s look at the histogram.

Not… so… inspiring – right? One huge peak biased to the light end of the intensity spectrum, and likely associated with the background. The ants are on the other end  -and not very distinct. Local or global threshold?

Binary images: Otsu (top-left), Minimum Error (top-right), Phansalkar (bottom-left), and Sauvola (bottom-right)

Here are some of the best approaches. First, for classical reasons, Otsu – which extracts the ants, but also some of their shadows. The other global method, Minimum Error extracts each of the ants nicely, but there are holes in the ants body, likely attributable to reflections. From a local thresholding perspective, Souvola extracts the outlines of the ants, and Phansalkar’s algorithm (a modification of Sauvola’s threshold which deals better with low contrast images) produces a result somewhere between Minimum-Error and Sauvola.

 

Ada loves exceptions

One of the great things about Ada is it loves exceptions. Exceptions can be used to deal with issues that other languages (let’s call them “C”-ish) tend to ignore. Take for example the following piece of code which does nothing but read in a number… if a valid integer is entered, the program ends.  If something other than an integer is entered, say a character, then an exception is triggered, and the message in the exception part of the program is output, and the program loops to ask for input again. The function skip_line() jumps to the start of the next line (avoiding anything in the buffer).

If an error occurs during execution of get(), the exception Data_Error is raised in the block, and execution of statements in the block is abandoned. Control is then transferred to the Data_Error exception handler associated with the block. In this case, the exception handler displays an error message, then executes skip_line(), going back to the next iteration of the loop. Data_Error is propagated if the input character sequence to get() does not belong to the range of the required subtype.

Fun eh? But certainly easy.

with Ada.Text_IO; use Ada.Text_IO;
with ada.Integer_Text_IO; use Ada.Integer_Text_IO;

procedure Xcept is
    n : integer;
begin
    loop
      begin
        put("Enter a number: ");
        get(n);
        exit;
      exception
        when Data_Error => put_line("Invalid number.");
        skip_line;
      end;
    end loop;
end Xcept;

Why you *have* to understand memory constraints

People tend to avoid learning about memory. It’s almost impossible in C,… but C lets you do things that other languages won’t have a bar of. You can’t assume that all things are possible. In languages such as Ada, the program will throw an exception if memory constraints are violated. Take for example the following simple piece of Ada code:

with Ada.Text_IO; use Ada.Text_IO;
with ada.Integer_Text_IO; use Ada.Integer_Text_IO;

procedure primes is
    n : integer;
    primeNumbers : array(0..integer'last) of integer;

begin
    put_line("Number of primes? ");
    get(n);
    put(n);
end primes;

It compiles okay, but when run, produces the following message:

raised CONSTRAINT_ERROR : erroneous memory access

Why? Print out the value of integer’last = 2147483647. Stacks only have so much memory, and aren’t bottomless. To get a better message, compile the program with  -fstack-check. Next time the program is run it produces the following error:

raised STORAGE_ERROR : stack overflow

This says it all. Over 2 billion elements is too much for a static array stored in the stack. It could be fixed by increasing the stack size, OR using dynamic memory. But more likely it could be fixed by not creating an array that large. Too many programmers assume resources are infinite, and they aren’t.  Even C doesn’t allow an array with 2147483647 elements in it – it’s just too large to store on the stack, and will result in a “Segmentation fault: 11” runtime error.

When data gets that big in most languages, you have to think beyond the stack and go dynamic.

 

What is sed?

Some people, when confronted with a problem, think “I know, I’ll use sed.” Now they have two problems.
– Jamie Zawinski, in The UNIX-HATERS Handbook

One of those long forgotten (by many) tools is sed, short for stream editor, a utility that parses and transforms text. It is a simple and compact programming language, developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. It is one of a number of tools, like awk, tr, and now perl, which can be used to manipulate text. It is considered somewhat antiquated now, but still worth looking at. People tend to shy away from it because it is cryptic. Here’s a simple example:

sed '/pattern/d' file

This will delete all lines in which the pattern exists, and print the results to the standard output. So if the pattern is circle, and the input (in file vader.txt) is

I’ve been waiting for you Obi-Wan. We meet again, at last.
The circle is now complete; when I left you, I was but the
learner, now I am the master. Only a master of evil, Darth.

the output will be:

I’ve been waiting for you Obi-Wan. We meet again, at last.
learner, now I am the master. Only a master of evil, Darth.

Here is a second example which strips a C program of inline comments using // (anywhere they occur).

sed 's://.*$::g' comments.c

The s is the substitution command, more commonly seen as s/…/…/, but as in the case above using a colon as the delimiter, i.e. s:…:…:. The // implies the string to search for. The .* matches zero or more characters after the first match. The $ appended on specifies the whole line. The g implies global replacement. So given this input (in comments.c):

#include <stdio.h>

int main(void)
{
    // Declare the variables for input
    double a,b;

    // Read in the values for the two numbers
    scanf("%lf%lf", &a, &b);
    // Compare the numbers
    if (a == b)
       printf("The numbers are equal\n"); // If equal
    else
       printf("The numbers are not equal\n"); // If not equal

    return 0;
}

The output would be:

#include <stdio.h>

int main(void)
{

    double a,b;


    scanf("%lf%lf", &a, &b);

    if (a == b)
        printf("The numbers are equal\n");
    else
        printf("The numbers are not equal\n");

    return 0;
}

Nice? This is essentially what the C compiler does before it starts parsing the code.

Go on… teach yourself sed!

A short guide to sed can be found here.

Image binarization (7) : What about local thresholding algorithms?

Binarization come sin two flavours: global and local. Global techniques, like those reliant on histograms, use information from the whole image to find one threshold which will make a binary image. Local techniques determine a different threshold value for every pixel based on characteristics of their surrounding area. Localized techniques are sometimes seen as a panacea for images with small details. But do they work? Sometimes they do, but not always. They often rely on the use of statistics from a local neighbourhood, and as such suffer from a lack of global context, i.e. they don’t know that one pixel may be associated with another in an object or region. Consider the following example which thresholds a piece of writing with a paper background suffering from some form of deterioration.

Here is its histogram:

Histogram of text image

Firstly, consider the global version of Otsu. Clearly the result is not optimal, because half the text has been obliterated by the background.

Global Otsu

Now consider three localized  thresholding algorithms local Otsu, Niblack, and Sauvola. The first algorithm tested is a localized version of Otsu. Localized Otsu does segment the text, but also segments every other artifact in the image. The region around the stain in the paper is difficult to differentiate text from stain.

Localized Otsu

Niblack’s localized algorithm does not fair much better, in fact it produces even more non-text related artifacts than Otsu.

Niblack

Finally Sauvola’s algorithm almost perfectly segments the text from the background, practically ignoring the stain or discontinuities in the background of the paper.

So what are these algorithms used for? They have primarily been developed over the years for the task of text segmentation, and usually they do a magnificent job in that realm… sometimes the image does need to be cleaned up, e.g. background suppression applied. Finally, consider the global version of Otsu. Clearly the result is not optimal, because half the text has been obliterated by the background.

 

 

Automatic line numbering

Ever wanted to include a piece of code somewhere, and also have line numbers in order to have something to refer to when you’re discussing the code.

Use nl.

It’s another of those totally useful Unix utilities – short for line numbering filter. And it’s simple to use. Used as nl test.c, it just numbers all non-empty lines, however if all lines should be numbered,  the parameters can be modified to nl -ba test.c. This takes a file like this:

#include <stdio.h>

int main(void)
{
    int a, b;
    double d;

    if (a == b)
        d = a * b;
    return 0;
}

And turns it into this:

     1  #include <stdio.h>
     2
     3  int main(void)
     4  {
     5      int a, b;
     6      double d;
     7
     8      if (a == b)
     9          d = a * b;
    10      return 0;
    11  }

 

 

 

Nested functions and scope in Julia

If you have never dealt with nested functions, you haven’t seen what their scope can do. Here’s a piece of code with one function main(). that contains two nested functions at the same level: nested1() and nested2(). Both these nested functions have access to both the parameter i, and the variable k. However, as they sit at the same level, they cannot access each others parameters.

function main(i)

    function nested1(j)
        nested2(8)
    end

    function nested2(j)
        println(i)
        println(j)
        println(k)
    end

    k = 7
    nested1(i)
end

main(5)

Now, if there is a name conflict, i.e. two variables having the same name in different scopes, the variable being used has local scope. For example if nested2() were replaced with the code below,  printing k would print the value 12, as opposed to 7 in the code above.

function nested2(j)
    k = 12
    println(j)
    println(k)
end