They control everything you read

There is one organisation that controls and influences every single piece of text you will ever read on a computer screen. If they were not standing between my computer and all of yours, this text would be very different. This is not a conspiracy theory. This is the truth.

I’m not talking about the Illuminati, the White House, NASA, GCHQ or the NSA (I hope – they could be censoring this text), the World Government, the Roswell aliens, the US Army, Facebook, Google, Apple, communists, Walmart, McDonalds or ISIS. I am talking about a group called the Unicode Consortium.

These guys are (we can only hope) good. They are a non-profit (always a good start) dedicated to developing a list of characters called the Unicode Standard. The fact they do this is a very, very good thing. Without Unicode, it is unlikely that the internet would have ever left the United States. 44% of this blog’s audience is outside the UK. Without Unicode, they (and 44% of you) would not be reading this.

To explain why, I will have to get a little technical.

On a computer hard drive, all information boils down to 1s and 0s. (These are actually tiny magnets, flipped on and off by the reader, but that’s not really relevant.) When you want to represent a number, that’s fairly easy. We use binary – each digit to the right you travel, the value of the digit doubles. 10 represents 2, 100 is 4. But for letters, you need to have a way of turning numbers into symbols. There are lots of ways of doing this.

The device that turns a string of 1s and 0s into letters is an interpreter. It has a list of numbers and the letters that correspond to those numbers. When you throw it a string of 1s and 0s, it will divide it into chunks (the size of which is determined by the character set) and read each one. It can then take each number and feed it to the processing on the program that wants the text. But you need a character set – that list of numbers and corresponding letters. And when you have some text you want to save, an encoder will use the same character set to save the text to the hard drive.

So first, there was ASCII. The American Standard Code for Information Interchange. This has space for 255 different characters. For example, the character

a

in ASCII is

0110 0001

In other words, a is ASCII number 141.

Now ASCII is all fine and good with an English speaking and writing userbase. But it does not work across nations. If you have to write just one more character set (say, Cyrillic), there will not be space in ASCII’s 255 characters.

At this point in the development of the computer and the internet, two things could have happened. Each alphabet could have used a dedicated character set. Everything would fit nicely in each country. But when it came to international information exchange, everything would fall apart. In Code Page 855, the Cyrillic equivalent of ASCII, character 141 is this.

Ϊ

If you take any simple phrase in English ASCII and take it to a Russian computer with a Code Page 855 interpreter, it will hideously mangle the message. It is almost impossible to convert between different formats – I know this from personal experience.

Or you could create Unicode. Unicode is a global standard for text encoding. It is massive. As of June 2015, there are 120,737 different Unicode characters. These cover 129 different alphabets from around the world, plus emojis. That covers almost every imaginable language – English, Tamil, Mandarin, Korean, all punctuation marks (you’d be surprised by how many there are), Hebrew, Braille, Tibetan, Cyrillic and Cherokee, to name but a few. The whole list can be found at http://unicode.org/standard/supported.html. Unfortunately for us nerds, the Unicode Consortium rejected an application to have Klingon added as a language.

The true scale of the Unicode project is pretty hard to grasp. That is, until you open the pdf file with the latest edition of the Unicode standard. It is a vast table, over 2000 pages long, filled with nothing but rows and rows and rows and rows of characters and numbers. You can actually buy Unicode as a book, in fact. It comes as two manuals that take up far too much shelf space and are pretty much useless. Without a search function, it takes hours to find any individual symbol.

Here’s an example. The character a in Unicode. In ASCII, which uses an 8-bit key, it looks like this.

0110 0001

Unicode uses a 32-bit key. That means it can store over 33 million times as many characters as ASCII. That same a looks like this.

0000 0000 0000 0000 0000 0000 0110 0001

But even with the tens of thousands of Unicode characters, there is a lot of space left. The 120,000-odd characters currently used represent a little over 1/65536 of the total space available to the Consortium if they so choose. With more that 4 billion spaces available, they shouldn’t run out any time soon.

So far, Unicode sounds like a programmer’s dream, right? A way to standardize data across languages and continents and to make all text readable. It would be so good if that were the truth.

Since 1991, there have been 19 versions of Unicode. Most programs will by default run the newest version (Unicode 8.0). But when one starts fudging around with old software that no one has updated, you find yourself against a wall. The newer software runs a version of Unicode that contains characters the older software doesn’t recognise. This is usually fine in the UK or US, but for East Asian languages this can be a real problem. And it gets worse.

Because of the vast size of Unicode, and the huge amounts of space left unused, people have been trying for ages to shorten it. Look at the code for that a again – there is space to fit 3 more ASCII characters, given over to padding. So, there have been several attempts to ‘compact’ Unicode into smaller chunks.

The main attempts have been:

  • UTF-1
  • UTF-7
  • UTF-8
  • UTF-EBCDIC
  • UTF-16
  • UTF-32
  • UCS-2

The problem arises when one has to work with multiple encodings in one program.

Here’s an example I had to battle with recently. I had about 100KB of text files, containing both German and Russian characters, encoded in UTF-8. UTF-8 is by far the most popular encoding, used by almost 50% of the internet. But the graphics package I was using (pygame, if anyone cares) uses UCS-2. I spent about a week grappling with converters, translators, interpreters and message boards. Nothing worked. I had to go through every single file and convert them manually to a UCS-2 encoding. Not a fun way to spend a lunch break.

So, Unicode is not always the knight in shining armour. But it did enable the internet. So we have that to thank it for.

Advertisements

One thought on “They control everything you read

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s