At Micro Focus we are putting some of our best minds on how to make cloud computing simple and effective.
Here are some thoughts on the sort of things we think about so our customers don't have to. However, some of this stuff is very interesting (if you like computer technology and think that way) so I figured I'd have a go at sharing.
With the advent of multi-core computing it was thought that simple multi-threading was the future for improving performance. For many business applications this has proven not to be the best way forward. What is more, we can learn from alternative approaches when considering moving business applications to the cloud.
Multi-threading is a very simple concept. If there are 4 cores (or CPUs, it makes no difference) performing a task, the idea is that each one of the four does a quarter of the work. If they all work at the same time, then the task will take one quarter of the time it would have if there was only one core.
This view of the world is just great if the task can be split simply into four pieces (or however many pieces as there are cores). The snag is that business applications generally process data. They do things like track parcels, manage money or handle bookings. When handling data across multiple cores, there will come a time when two or more cores need to read and write to the same piece of data at the same time.
A perfect example of this is a bank account. If a business application is working out deductions and additions to a back account, then the order in which they are performed makes a huge difference to the end result. If a wage is paid in after a mortgage payment is taken out, then a large overdraft charge could be triggered. In this situation it is clearly important that different cores communicate with each other as to what they are doing and when. This is a mechanism is called synchronization.
On the face of it, synchronization looks like it should be very fast and very simple. It is easy to think of a multi-core computer being set up internally a bit like the picture below.
Each core sees the computer's memory (the chip based random access memory - not disk) directly. This means that if one core makes a change all the other cores immediately see that change. In effect, the cores communicate with each other by changing the data in the main memory. To synchronize, all the cores need to do is check at what point in a work-flow each one is at and maybe only allow one at a time to write to particular bits of memory. In the bank example, we could make it so that only one core can actually write to the bank account data at once and the order in which writes are made has to be strict date order of the bank transactions.
In this simple model, we will see that most of the processing will happen in parallel (4 cores means four things are happening at once) but some will happen serially - only one core working at once. Whilst this is not quite as fast as cores working all the time, it is not bad either.
Reality gets in the way
However, the real situation is nothing like as simple or easy to work with as that! The problem is that central memory is generally insanely slow compared to modern processors. It is hard to get one's head around just how slow it is. In a modern PC one of the cores can think at least ten times faster than the main memory can supply data to it. When a machine as 2,4,8 or more cores, the problem becomes immense. If we really used the simple memory model above, modern computers would spend the vast majority of their time waiting for main memory and achieving nothing at all.
Luckily there is a concept which can come to the rescue. "Locality of access" is a simple observation, a core tends to do a load of processing on one area of memory and then move onto another area. This means that caching can happen. The bit of memory the core is working on can be downloaded into a very high speed cache memory. Indeed, one can have different levels of cache, each smaller in size but faster than the last. This means that most of the time the core will be working on a super high speed copy of the memory it is interested in. At some point the hardware of the computer will work out that it has to move that copy back into main memory and maybe load the cache up with a copy of a different bit of main memory.
When all this is put in place we get a memory plan which looks a little bit like the picture below:
Whilst this is great where each thread is 'doing its own thing' it causes great problems when threads start sharing data. To share data they have to upload what is in their caches into main memory and load back the shared version (there are some cleaver hardware tricks to improve on this - but the principle remains).
It gets worse before it gets better
This cache issue, in its self, is not too bad; the real problem comes when one considers the programs which might share data. IE they don't know if they are sharing data or not. Even if a thread has a piece of memory all to its self, it still needs to 'lock' that piece of memory in case another thread uses it. This synchronization is very expensive as it requires interaction with main memory even if no data is actually being shared.
One solution to this is to rethink the way threads communicate to each other. The idea is to make each thread only ever use 'its own memory'. One way of doing this is to make each thread an actual process (on Windows or Unix). Google's Chrome browser takes this approach, where each page is managed in its own memory as an individual process. This removes the need for many of the synchronizations that are required in traditional multi-threaded programs.
You might have realised that the above approach does not match the above picture and it does not work. The problem is that I have gotten rid of synchronization but not gotten rid of shared data. There will still be some data which must be shared in a controlled way (like the back account). The way around this is to pass messaged between the threads. They share small specific bits of memory which contain data which they 'send' to one another. No two threads even read and write the same memory at the same time. This might seem clunky, but it gets rid all but the absolutely necessary synchronization and so generally works out a better approach, especially when there is a very large number of cores.
We see a general architecture emerge
So, message passing is the (or one of the) most effective ways of making business applications work well on multi-core machines. The rather wonderful thing about it is that only interconnections between memory areas are the messages. These do not have to be in shared memory. It is just a legitimate to pass messaged over some form of network. The architecture we arrive at has separate threads which are completely isolated form one another and which pass messages via a message bus:
That message bus can be shared memory or it can be the high speed network of a Cloud data centre. Yes - this approach means that programs will run will on multi-core machines and in the cloud.
Labels: cloud computing, high performance computing, hpc