Probably majority of software developers are familiar with multithreading concepts and utilization of multiple cores of a single processor. I’m quite sure however, that there exists a large audience of people, who have just started their adventure in the programming world and may not yet be familiar with the idea.
In this article, I’d like to talk a little bit about the correlation between the number of threads used by an application and the number of cores owned by the processor.
To make it more interesting, let’s take a well knows math example. There is an interesting integer number sequence, called Fibonacci sequence. Each of the number in the sequence is created by sum of two previous numbers, for example:
1, 1, 2, 3, 5, 8, 13, 21, 34, …
So 5 becomes as a sum of 2 and 3, while 8 as a sum of 3 and 5, etc.
We could write a simple function, which would calculate any number in the sequence for a given index, but let’s not use a loop for this. For the sake of keeping the processor busy, we can have a function, which would call itself recursively, like:
With the help of another function I measured, that running fibo(3) – which returns number 2 as it is 3rd in a row – made a total 5 calls of the function:
1. fibo(3) calls:
2. -> fibo(2) calls:
3. -> fibo(1) returns 1
4. -> fibo(0) returns 0
5. -> fibo(1) returns 1
Now, with each subsequent index, the amount of recursive calls grows and time needed to calculate the number takes visibly longer. For example fibo(30), which returns number 832,040 requires almost 2.7 million recursive calls and lasts approx. 22 ms on my laptop, while fibo(40), which returns number 102,334,155 requires 331 million recursive calls and lasts around 2.5 sec.
All right then, let’s try this: we can call fibo(n) in a loop, starting from n=1 until n=45. We’ll start running this in a single thread, and then we’ll go multithreaded.
This is a code for a single threaded test:
On my computer the whole test lasted almost one minute. No wonder, if you’ll check, how much time it needed to calculate indexes between 40 and 45:
fibonnaci number index n = 41 is: 165580141 3212 ms
fibonnaci number index n = 42 is: 267914296 5286 ms
fibonnaci number index n = 43 is: 433494437 8682 ms
fibonnaci number index n = 44 is: 701408733 13764 ms
fibonnaci number index n = 45 is: 1134903170 22251 ms
Total time for single thread is 58478 ms
The last index 45 took 22 seconds to calculate! And from 40 to 45, it took 58.4 seconds.
This is the moment, when one would think about multithreading. Why should each of these calculations wait until the previous one has completed? Can’t we run all of them simultaneously? Well, not all of them, but surely some. How many then?
JXcore allows you to create up to 64 threads, but this amount would make sense only if you would have 64 cores on the board. Having just 4 cores, as my laptop has – it made no difference if I used 4 threads or 16. But let’s see the code first.
Below is the previous code modified for the multithreading:
My computer’s processor has 4 cores, but I have chosen to use just 2 at first run (jxcore.tasks.setThreadCount(2)). This alone was enough to make the sample complete faster – all calculations took just 46 sec.
fibonnaci number index n = 41 is: 165580141 4593 ms threadId 1
fibonnaci number index n = 42 is: 267914296 7390 ms threadId 0
fibonnaci number index n = 43 is: 433494437 12156 ms threadId 1
fibonnaci number index n = 44 is: 701408733 19718 ms threadId 0
fibonnaci number index n = 45 is: 1134903170 25596 ms threadId 1
Total time for 2 threads is 45319 ms
As you might notice, the time for each single fibo(x) call is slightly longer. However, since 2 threads were employed to run the code, and they were utilizing 2 cores instead of 1. As a result, the whole application has finished sooner. Overall, we gained 30% performance from using 2 cores. You might wonder why the single operation took slightly longer to finish. It’s because of my CPU is SIMD based and I bet the result would be much better with a more recent 2013 i7, which is a MIMD powered one (multiple instructions, multiple data).
How long would it take with more than 2 threads? Below are some results:
Total time for 5 threads is 34062 ms
Total time for 8 threads is 33549 ms
Total time for 16 threads is 33301 ms
When looking at those results we come to the conclusion, that setting more threads in the application than the number of processor’s cores, does not bring significant performance gains (at least for SIMD). Also let’s not forget, that those measurements are the results of on-time test runs. Probably running multiple tests and taking an average will produce more consistent measurements for 4, 5, 8 or 16 threads.
The multithreaded sample presented above is simplified for illustration of the subject. It utilizes the threads immediately after application has been started.
My friend have told me, that this is not even realistic example. In a real life, JXcore the application would start, listen for users’ calls and eventually process them in threads. This scenario fits to the application’s upstart concept – the full performance of the application can be experienced not yet right at the start, but in a certain period after that.
Whether the above sample is realistic or not, the truth is, that it may not work on all kind of processors in the same manner. Assigning tasks for the subthreads right from the start of the application may lock only one of threads to be used for all of the tasks, since the other threads didn’t have time to be loaded yet. If that would happen in case of our example, the total duration of the application would not be faster as the single threaded version.
One of the good approaches to make sure, that our tasks will have all of the threads for use (in other words, to make sure, that threads are up-and-running) would be to initiate the threads by running single task at first, and once it’s completed – running the rest of them.
I prepared a modified version of the code illustrating this approach. It’s available here: task_fibonacci.js. You may just copy-paste it to your favorite editor, save as a file and run.