Client CPU/GPU scheduling
Prior to version 6.3, the BOINC client assumed that each running application uses 1 CPU. Starting with version 6.3, this is generalized.
- Apps may use coprocessors (such as GPUs).
- The number of CPUs used by an app may be more or less than one, and it need not be an integer.
For example, an app might use 2 CUDA GPUs and 0.5 CPUs. This information is visible in the BOINC Manager.
The client's scheduler (i.e., the decision of which apps to run) has been modified to accommodate this diversity of apps.
The way things used to work
The old scheduling policy is:
- Order runnable jobs by "importance" (determined by whether the job is in danger of missing its deadline, and the long-term debt of its project).
- Run jobs in order of decreasing importance. Skip those that would exceed RAM limits. Keep going until we're running NCPUS jobs.
There's a bit more to it than that - e.g., we avoid preempting jobs that haven't checkpointed recently - but that's the basic idea.
How things work in 6.3
The main design goal of the new scheduler is to use all resources. In particular, we try to always use the GPU even if that means overcommitting the CPU. "Overcommitting" means running a set of apps whose demand for for CPUs exceeds the actual number of CPUs.
The new policy is:
- Scan the set of runnable jobs in decreasing order of importance.
- If a job uses a resource that's not already fully utilized, and fits in RAM, run it.
Example: suppose we're on a machine with 1 CPU and 1 GPU, and that we have the following runnable jobs (in order of decreasing importance):
1) 1 CPU, 0 GPU 2) 1 CPU, 0 GPU 3) .5 CPU, 1 GPU
What should we run? If we use the old policy we'll just run 1), and the GPU will be idle. This is bad - the GPU typically is 50X faster than the CPU, and it seems like we should use it if at all possible.
The new policy will do the following:
- Run job 1.
- Skip job 2 because the CPU is already fully utilized.
- Run job 3 because the GPU is not fully utilized.
So we end up running jobs whose CPU demand is 1.5. That's OK - they just run slower than if running alone.
- A preference to limit # of GPUs used.
- (More specific) a preference to say which specific GPU(s) can be used