Are there too many PhDs?

I have supervised two PhD students over my career, which was more than enough. But don’t we need more highly educated individuals? I would argue no, and here’s why. Unfortunately there are just too many PhDs out there. There just aren’t enough academic-related jobs out there for the oversupply of PhDs.

But you say PhDs could work in industry? Sure some could, but the reality is that people with doctoral degrees are often (a) overqualified in a narrow field of expertise, and (b) not trained to work in anything but academia. It’s easy to become overqualified. No company is going to hire a handful of PhDs in computer science, certainly not when they can hire people with a basic degree and 6-8 equivalent years of experience. Just because someone has written a thesis on some esoteric aspect of AI, does not mean they can actually design or implement a large-scale AI system. There is no barrier to someone with experience being able to deign an AI system, none what-so-ever. So people are actually paying for a piece of paper. And most people who get a PhD want to work in academia – in Nature’s 2019 PhD survey 56% of respondents said that academia is their first choice for a career. Just under 30% chose industry as their preferred destination.

Academia generally trains people for academia, not the greater world of industry, or even government jobs. The emphasis is on research and writing papers, not things like management, and other organizational things. They don’t even teach people how to teach (which is somewhat ironic). Teaching is something people learn as a side-gig, teaching the odd class as a sessional. That’s not to say that people with PhD’s don’t find work in industry, they do. Of the five individuals that worked in our computer systems engineering lab at RMIT, I’m the only one that ended up in academia – the rest have had successful industry careers, but few work in the exact topic they did their research in. But a PhD two decades ago was more likely to get you an academic job.

A 2021 article in University Affairs, described the contradiction between PhD’s and career prospects. Between 2002 and 2017 the number of PhDs graduated in Canadian universities doubled from 3,723 to 8,000. But the number of tenure-stream professors hasn’t changed much – from 36,053 to 45,660 (a 25% increase). This means greater competition for fewer jobs, and non-academic sectors have not increased their uptake. Part of the lack of jobs does stem from university faculty not retiring (due to a lack of mandatory retirement at 65, which ended in Ontario in 2006). In 2016, 14,217 (31%) faculty were in their 50’s and 10,560 (23%) were 60+, so more than half the academics in Canada were over the age of 50. That’s a real problem because there is now a substantial generational gap in many departments. Institutions have made this worse by failing to invest in faculty hires over the last decade. A lack of renewal? Most certainly – the same period the 29 and under group represented a paltry 0.5%. Even the 30’s age group only comprised 14.8%. And that’s a problem.

Ultimately the profusion of PhDs is caused by departments wanting to expand their graduate programs without any regard for employment prospects of graduating students. This is made worse by funding bodies who feel that a majority of grant money should go to paying students – it’s a bit of a vicious cycle. A PhD also does not guarantee better earnings, especially for some STEM fields. According to a 2020 StatsCan survey, a (male) CS PhD has median earnings (2017) of C$98,484, and that’s in the upper echelon. A PhD in statistics sits at C$86,247, and most biological fields are below C$59,275 (that’s below a PhD in history, whose median earnings are C$68,120. Not surprisingly salaries for female graduates are generally lower, which is really quite sad that there are two sets of statistics – so much for equity. For CS this means C$73,678.

The message here is that if you are considering doing a PhD, please look at both job prospects and potential earnings before taking spending 4-6 years doing something you may later regret.

Using awk: Adding line numbers to a program

There might be times when you want to add line numbers to a program, for whatever reason. Maybe you want to post the program somewhere, or just print it out. You could do it by hand, but why when Unix provides the simplest of tools to do it – awk. Here is the code (where filename is the file to be processed):

awk '{printf "%d\t%s\n", NR, $0}' < filename

Basically all this does is process ever line of the file, printing out the result to standard output. The is done by printf using the string “%d\t%s\n“. The %d specifies that the first thing to be printed will be an integer, in this case NR, which is a built-in variable that stands for “number of records”. Each line is a record, so the value printed will be the line number. The \t inserts a tab in the output. The %s prints out a string, in this case $0, which represents the entire line of input. The \n inserts a newline in the output. So below is the input and output for a Pascal program.

program factorial;

var i, n, fact : integer;

begin
    writeln('Enter a number: ');
    read(n);
    fact := 1;
    for i := 2 to n do
        fact := fact * i;
    write(n,'! = ', fact);
end.

1	program factorial;
2
3	var i, n, fact : integer;
4
5	begin
6	    writeln('Enter a number: ');
7	    read(n);
8	    fact := 1;
9	    for i := 2 to n do
10	        fact := fact * i;
11	    write(n,'! = ', fact);
12	end.

Obviously to move the result to a file, just redirect the output.

The shell: Calculating the size of directories

Sometimes you want to find the size of a folder, but honestly it is somewhat annoying to have to do it via ⌘i. All I want is a summary, not having to look at each folder individually. Shell to the rescue again with the disk usage, du, command. So if we use du -h -s * it will print a list the sizes of all the files/directories in the current directory. For example:

%du -h -s *
 20K	brunel
 32K	egfor
 16K	ffwi
 12K	gamePIG
8.0K	intmult
380K	mugWump
296K	nsqrt
  0B	ordinance
 44K	pascalsTriangle
416K	plurals
 28K	shelfLoad
860K	soundex

The parameters are -h to print human readable sizes (B, K etc.) and -s to print a summary usage of each directory. Since this will output the summary for each parameter I finally pass * to be changed to all files/dirs in that directory. To deal with only directories we can instead use:

du -h -s `ls -d */`

If you want to go a step further and display only folders that are in Gigibytes in size (replace G with M for megabytes):

du -h -s `ls -d */` | grep '^\s*[0-9\.]\+G'

You can also add the -c flag to print the total size.

What makes a successful CS grad student?

It sometimes seems like everyone wants to go to grad school. This is especially true of overseas students, for whatever reason. Sometimes it is to broaden their horizons, other times to start a new life somewhere. But it may be surprising to learn that success isn’t predicated by supposed “good grades” at university. In fact just because you get a 95% average doesn’t necessarily mean you will do well in grad school… or rather you might do well in coursework, but fail abysmally when it comes to actually doing research and writing a thesis. Just because you have a masters degree does not mean you will be successful at a PhD, or that you even have the right mindset.

So what makes a successful CS grad student?

The ability to solve problems. If you have a hard time solving problems, then a graduate degree is not for you. You have to be able to tackle a problem by looking at it from all facets, and the ability to look beyond the problem into other disciplines. You can’t always google the solution to a problem.
The ability to do research. You have to find means of investigating things, and understanding what current approaches are. This does not mean just writing a literature review.
The ability to program. You have to be able to program proficiently in a number of realistic programming languages. I’m not talking Microsoft stuff, and I mean something beyond three different C-based languages. You should be able to be thrown a new language and become proficient in a short period of time.
The ability to write. Ultimately to become successful as a graduate student you have to write a thesis. You have to understand how to write, and honestly have the ability to write more than just in scientific terms. Life is not a bunch of conference or journal papers. In Canada this means you also have to be proficient in English (or French). No professor should be editing a thesis to make it understandable.
The ability to communicate. Beyond writing you also have to be able to verbally communicate. Nobody likes to listen to boring presentations, where people just read off slides.

Finally you have to show some passion about the research you hope to do. The biggest mistake potential students make is sending generic emails saying “your research aligns with what I’m interested in”. That’s baloney. If you have a degree in electrical engineering there is no way you research aligns with mine. And quoting a bunch of my specialties tells me nothing more than you can cut-and-paste.

Ultimately you have to put the effort into graduate school. Earning a graduate degree however does not guarantee you will get a job any easier than somebody with an undergraduate degree. Why? Because many undergrad CS students have COOP these days, and experience always outclasses another two years in a masters degree… or umpteen years in a PhD. Companies want to hire people who have all the above abilities, and experience, not necessarily someone who has done *more* courses, and has written a very narrowly scoped thesis.

Using awk: Counting words?

Ever wondered how the Unix word count utility wc works? It’s likely written in C, but here is a simple awk version.

{ nc += length($0) + 1
  nw += NF
}
END { print NR, nw, nc }

The code is input into a file, say wcawk, and then this is executed in the following manner:

awk -f wcawk textfile

The first line of code counts the number of characters as the length of the string representing the entire line, plus 1 for the \n character, and the second line counts the number of words, by adding NF, the built-in variable for the number of fields. The fourth line just prints out the number of lines (NR), word count and character count. Here’s a sample:

I've been waiting for you Obi-Wan. We meet again, at last.
The circle is now complete; when I left you, I was but the
learner, now I am the master. Only a master of evil, Darth.

And here’s what happens when awk is run: 3 36 181

What abut word frequencies, could you do that in awk? Sure can. Here’s the code:

awk '   { for (i=1; i<=NF; i=i+1) freq[$i]++ }
END     { for (word in freq) print word, freq[word] }
' $*

In this case we have included awk in the script, and just made the file executable. The first for loop looks at each word in the input line, incrementing the element of the array freq subscripted by the word. After the file has been read, the second for loop prints the words and their counts. When run it prints them in a long, arbitrary list. To make the output nicer, it can first be piped into sort, then into column. Below is the output from the sample text:

% wfreq vader.txt | sort | column
Darth. 1	a 1		complete; 1	master 1	was 1
I 3		again, 1	evil, 1		master. 1	when 1
I've 1		am 1		for 1		meet 1		you 1
Obi-Wan. 1	at 1		is 1		now 2		you, 1
Only 1		been 1		last. 1		of 1
The 1		but 1		learner, 1	the 2
We 1		circle 1	left 1		waiting 1

Why some Python libraries aren’t all they’re cracked up to be

I like Python, I mean it’s quite good at doing some things. But I really don’t like is when people say things like “Why are you using Fortran… you should switch to Python.” Why exactly? I mean Fortran is inherently faster. Sure I don’t get access to all the fancy Python libraries (which some may consider a blessing), but does it really matter that much? The biggest problem with Python is that some of it’s libraries aren’t actually fully written in Python. Take SciPy for example (quoting from their own website):

“SciPy wraps highly-optimized implementations written in low-level languages like Fortran, C, and C++. Enjoy the flexibility of Python with the speed of compiled code.”

So are they effectively implying that it’s not possible to create “highly-optimized” versions of some functions in Python? Likely it isn’t possible, so they rely on C, C++ and Fortran. According to a recent paper [1] which used a program called linguist to analyze SciPy, it is “50% Python, 25% Fortran, 20% C, 3% Cython and 2% C++, with a dash of TeX, Matlab, shell script and Make”. In fact SciPy is full of Fortran code for numerical integration and solution of initial value problems (QUADPACK, ODEPACK), and performing Fourier transforms (FFTPACK), to name but a few. If we look at QUADPACK, it seems like the central algorithms are all written in Fortran, but not modern Fortran – Fortran 77, complete with goto statements and do-continue loops. Cool. 🙃

I get why they libraries use Fortran, I mean it likely speeds up Python. I mean it’s not that hard to call Fortran from Python. So for SciPy they have Python wrappers, around C-wrappers, around Fortran code. What about some of the other Python libraries (according to github)?

Numpy: Python 62.2%, C 35.3%, C++ 1.3%, Cython 0.9%, Shell 0.2%, Fortran 0.1%.
Pillow: Python 60.7%, C 38%, HTML 0.5%, Postscript 0.4%, Shell 0.2%
Tensorflow: C++ 63.1%, Python 21.4%, MLIR 5.7%, Starlark 3.8%, HTML 2.4%, Go 1.1%

What it says is that libraries which rely heavily on math processing often rely on another language to do the heavy lifting. Maybe that’s why the rest of Python, that relies on more pure Python code, is so slow?

Virtanen, P, et al., “SciPy 1.0: fundamental algorithms for scientific computing in Python”, Nature Methods, 17, pp. 261-272 (2020)

University degrees ≠ intelligence

It’s funny how some people equate the number of university degrees they have with intelligence. Of course some people tend to have an innate level of self-absorption, so it’s not that surprising. But you get people that make those sort of statements, usually when they are having some discussion with or about people with fewer or no degrees. I have never really understood why. I have a bunch of degrees, and I don’t feel that makes me any better than anyone else. They are just degrees, pieces of paper that prove nothing really.

I mean even people with a PhD, the highest level of university education can show a pure lack of intelligence. They may know a lot about a narrow scope of knowledge, but be completely clueless about the world at large. Or have an inability to change a light-bulb, let alone use a screwdriver. Just because someone works with their hands does not mean they lack intelligence – quite the contrary. Leonardo Da Vinci was a spectacular individual who was a painter, a draughtsman, an engineer, a scientist, a theorist, a sculptor, and an architect. It has recently been proven by MIT that his 500 year old bridge design would actually have worked. He had no formal training beyond reading, writing, and arithmetic. Isambard Kingdom Brunel (1806-1859), prominent English engineer actually trained as a watchmaker, and learned most of his engineering skills from his father, Marc Isambard Brunel (the first civil engineering degrees were not awarded until the 1890s). The many bridges Brunel built so long ago still stand today – a testament to a skilled individual.

Most degrees are merely pieces of paper suggesting that a person achieved passing grades in the courses they took. This is because many degrees do not actually require any level of actual experiential learning (unlike medical degrees). There are some people who get engineering degrees that wouldn’t even know what the business end of a hammer is. There are computer scientists (some with PhDs) who couldn’t write a piece of software. There are people who get economics degrees that have no clue about their own personal finances. You get the picture.

You don’t need a university degree to be successful, and having one is no guarantee of success. We may never see the likes of a Da Vinci again, but then again maybe we just don’t look close enough? Remember, just because you have a degree doesn’t make you better than anyone else.

The shell: Finding files using a specific name

So what’s the easiest way to find a file if you know part of the name? For example if we want to find all files (recursively from the current directory), with the substring “Leib”:

find . -name '*Leib*'

Maybe you want the search to be case insensitive? So in the example it might search for “Leib”, or “leib”? Then use -iname instead of -name. Here’s some sample output:

find . -iname '*Leib*'
./codeFORTRAN/piLeibniz.f95
./codeC/pracniques/pi_Leibniz
./codeC/pracniques/pi_Leibniz/pi_leibniz.c
./codeC/pracniques/pi_Leibniz/pi_leibnizfunc.c
./codeC/pracniques/pi_Leibniz/pi_leibnizREF.c
./codeC/pracniques/pi_Leibniz/pi_leibniz128.c

If you only want to search for Fortran files then:

find . -iname '*Leib*.f95'

If this doesn’t provide enough information, you can append -ls to the command. This provides detailed information on the file. For example:

find . -iname '*Leib*.f95' -ls
985410  8 -rw-r--r--  1 vader1 staff 1207  1 Dec  2014 ./codeFORTRAN/piLeibniz.f95

What is a modern programming language?

I find it funny when people talk about teaching modern programming languages. How exactly is modern defined? Are we talking about languages which are in demand right now (due to job postings perhaps)? Or perhaps languages that developers report they are using? Let’s look at the latter via Stack Overflow’s 2022 Developer Survey of over 70,000 developers. These surveys are kind-of eye-opening, but not in the way you would think.

In the technology section they tend to lump programming, scripting and mark-up languages in one category. Of those Javascript makes the top of the list at 65.36%. I get it, it’s the most popular because of how much it’s used in websites, but it’s not a hard-core language. Next is HTML/CSS at 55.08% – mark-up languages, NOT programming languages. If we were to look at core programming languages, as we make our way through the list we get: Python 48.07%, Java 33.27%, C# 27.98%, C++ 22.55%, C 19.27%. From there it tends to spiral down to nothing. In this list I don’t really see “modern” programming languages. Ruby only sits at 6.05%, Swift at 4.91%.

Of course there are some issues with these surveys. It is predominantly people who use Stackoverflow, where the dominant age group is 25-34 (39.62%), which is arguably the largest demographic, and 56% of people only have 1-9 years represents of experience. And 70,000 represents 0.25% of the 27 million odd software developers in the world. And let’s face it, if you’re writing HTML and CSS, it’s isn’t exactly software development (sorry, it just isn’t), and Javascript, well it’s more than likely a bit of a psychosis (and besides can you really trust a language that doesn’t have real integers?).

There are 101 of these “top programming lists”, all of which say the same thing. None come from extensive industry surveys. And none of these languages are “modern” in the real sense of the word. They are modern in the sense that all languages evolve. Fortran 2018 is a modern language. The TIOBE index, puts the top languages as Python, C, Java, C++, C#, Visual Basic, Javascript, etc. etc. But you have to take these ranking with a grain of salt as Cobol is ranked 25th, and is still in heavy demand in financial circles. Still it is telling, none of the top seven are modern languages per se.

So when someone talks about teaching a course on “modern” programming languages what exactly are they talking about? Are they talking about paradigm? Of the four main paradigms: procedural, OO, scripting and functional, the first three are well covered. So perhaps functional programming languages? They are used in data science and machine learning, and obviously will have a role in the future of programming. So a course could be filled with Clojure, Elixir, Haskell and/or Scala – but then just call it a course on functional programming. But the paradigm has been around since Lisp appeared in 1958, so these are just modern renditions of functional languages.

So what is a modern industry programming language? It doesn’t exist. It’s a bit like someone who says they’re creating a modern hammer – sure it might look more aesthetically pleasing, or be made of titanium, but it still fundamentally works the same as a hammer from 1000 years ago. Perhaps academics have some strange notion of what a modern language is? Besides which, if you are a good programmer you can teach yourself a new “modern” language, right?

Nonsensical recursion algorithms: Stutter

This simple recursive C program returns an integer obtained from n by replacing every digit with two of that digit. For example, stutter(348) returns 334488. It’s a bit nonsensical.

#include <stdio.h>

int stutter(int n)
{
   if (n < 0)
      return -stutter(-n);
   else if (n < 10)
      return n * 11;
   else
      return 100 * stutter(n/10) + 11 * (n % 10);
}

int main(void)
{
   int n;
   printf("Input a number: ");
   scanf("%d", &n);
   printf("The stuttered version is: %d\n", stutter(n));

   return 0;
}

Here is some sample output:

Input a number: 2534
The stuttered version is: 22553344

The Craft of Coding

Musings on programming and education

Month: November 2022