Snap! Websites
An Open Source CMS System in C++
As I've been working on an MP3 decoder/encoder system capable of decoding and encoding MP3 data in parallel, I've had the chance of testing that system with a plethora of worker threads (my main server has 64 CPUs).
Along the way, I've got many surprises.
First of all, some jobs, even though I have 64 CPUs, do not make use of much more than 8 CPUs. That is, whether I run with 64 or 8, the result is that I get the task done in about the same amount of time. However, when using just 8 CPUs, I can run the command 8 times in parallel and therefore process 8 files simultaneously at full speed.
This first case happens when I deal with smaller audio files, such as under 1 minute in duration. Being able to run more processes is actually very useful when dealing with many hits on a backend server and having to process files as close as possible to realtime.
I also worked on much larger files (1h to 8h of audio) and those definitely make use of all the 64 CPUs for a little while. The fact is that there are points in time when the processing is serialized and thus parallelization cannot be applied. So for part of the time, some worker threads are just sitting around using only 20% to 60% of their CPU.
If you do not have any downtime, i.e. if the CPUs are always used at close to 100%, then using all the worker threads makes a lot of sense. You will definitely decrease the time it takes to process the data. In my case, this was true when there was limited processing between the decoder and the encoder. In that case, using 64 CPUs was best and it can reduce the processing that would take a minute or two with lame to seconds with my tool.
In our software, I was able to parallelize the decoding of MP3. This means I can read one frame of input and send it to worker thread 1. Read the next frame and send it to worker thread 2. Etc. (In my implementation, I use about 1,000 frame per worker thread, it's less of a waste, but the principle is the same).
For M4A data, I unfortunately have to use the ffmpeg libraries to decode the data and as a result, the tool ends up using only about 14 CPUs (when I time the process, I can see that the CPU usage was about 1400%).
So thinking I could then avoid allocating CPUs, I changed the number of workers to just 12. The result is that in realtime it used only 20s instead of 40s to process the same file. But why having more threads wouldn't help the process go as fast as possible?
I did a little bit of research and the fact is that when using all 64 CPUs, I have 64 workers that wake up often and lock a mutex. This prevents many other threads from doing any work. They are starved waiting for the lock to be available.
When I reduced the number of worker threads to just 12, I saw a huge increase in the CPU usage. The percentage jump much closer to 100% for most of the processing (early on, the decoding from the m4a data is a little slow to get started).
But just that does not explain why it would be faster as seen on the final timings (as output by the /usr/bin/time command). After all, if I use 4 CPUs at 25% or 1 CPU at 100%, the amount of time should remain pretty close (outside of context switches, but we don't run that long that those would have much on an effect on the calculated final time).
One thing that the process does is create packets and post them for the next processor to handle it. In order to handle that, I use FIFOs within the processor. However, there is only one common command center. This one function checks for the next processor with a workload that's ready. If such exists, then its processing command gets called with that packet.
The issue with all the FIFOs is that each time I add a new packet, I signal one worker thread to make sure the next payload gets processed. That means with 64 worker threads not just crushing numbers we end up in that one common command center with one mutex which gets locked, waited on, signaled... While the lock is in place, all the other worker threads are stuck waiting for themselves to obtain the lock and check for a payload to work on. All of these locks is what makes the software run slower when running with 64 CPUs instead of just 12.
So here I learned that using more CPUs is great, but only if there is a huge amount of processing to be done or a worker thread wakes up only if it has work to do without having to check a lock common to all threads (option which is not available in my case).
The audio processing includes resampling and gain adjustments.
The problem with those two processors is that they need to be run in serial mode. This means all the packets have to be passed to those processors one at a time, in order. No parallelism is available.
This is a huge bottleneck. Testing with a different file with which I was expecting about the same number of processors used (1200% as above), the usage droped to an abysmal 350%. I am now looking at parallelizing this work because we really need things to go a lot faster. This is proving quite a bit more complicated than at first glanced, but this is just some math problems. For the gain, I need 3 seconds of data before the gain adjustment happens. So in a parallel version, I need to send an additional 3 seconds before the point in time where I want the data to be adjusted to get the correct amplitude levels (decibels).
Although some other point in my graph have a similar serialization issue. For example, the output of a parallelized component, such as the MP3 decoder, comes out in any order. I have to serialize that data again to make sure the final output is start to finish. This works great and is very fast because there is no processing of the data, just re-ordering.
Snap! Websites
An Open Source CMS System in C++