Server job scheduling
This document describes proposed changes to server scheduling policies:
- Feeder enumeration order
- Of the jobs in shared mem, which to send to a host
- Which app versions to use
- What deadlines to assign
These policies work in conjunction with batch scheduling policies.
Quality of service types
We seek to handle the following types of QoS:
- Non-batch, throughput-oriented, fixed latency bound.
- Batches with long deadlines (greater then the turnaround time of almost all hosts).
- Batches to be completed "as fast as possible" (AFAP), with no a priori deadline.
- Batches with a short deadline.
The goals of the policies include:
- Support the QoS features.
- Avoid assigning "tight" deadlines unnecessarily, because
- Doing so may make it impossible to assign tight deadlines to jobs that actually need it.
- Tight deadlines often force the client to preempt other jobs, which irritates some volunteers.
- Avoid long delays between the completion of a job instance and its validation. These delays irritate volunteers and increase server disk usage.
- Minimize server configuration.
We need a way to identify hosts that can turn around jobs quickly and reliably. Notes:
- This is a property of (host, app version), not host.
- This is not the same as processor speed.
A host may have high turnaround time for various reasons:
- Large min work buffer size.
- Attached to lots of other projects.
- Long periods of unavailability or network disconneciton.
We propose the following. For each app A:
- For each (host, app version) let X be the percentile of turnaround time
- For each (host, app version) let Y be the percentile of "consecutive valid results" (or +infinity if > 10) over all active hosts and all current app versions.
- Let P(H, AV) = min(X, Y)
This will be computed periodically (say, 24 hours) by a utility program.
- When a new app version is deployed, the host_app_version records for the previous version should be copied, on the assumption that hosts reliable for on version will be reliable for the next.
Batch completion estimation
The proposed policies require estimates C(B) of batch completion, I'm not sure exactly how to compute these, but
- it should be based on completed and validated jobs rather than a priori FLOPs estimates.
- it should reflect (host, app version) information (e.g. turnaround and elapsed time statistics) for the hosts that have completed jobs, and for the host population as a whole
- They should be computed by a daemon process, triggered by the passage time and by the validation of jobs in the batch.
- C(B) is different from the "logical end time" of the batch used in batch scheduling.
- For long-deadline batches, C(B) should probably be at least the original delay bound plus the greatest dispatch time of first instance of a job. I.e. if it takes a long time to dispatch the first instances, adjust the deadline accordingly to avoid creating a deadline crunch.
Proposed feeder policy
Proposed enumeration order:
(LET(J) asc, nretries desc)
where LET(J) is the logical end time of the job's batch.
Proposed scheduler policy
For each processor type T (CPU and GPU) we have a "busy time" BT: the time already committed to high-priority jobs. For a given job J we can compute the estimated runtime R. The earliest we can expect to finish J is then BT + R, so that's the earliest deadline we can assign. Call this MD(J, T).
For each app A and processor type T, compute the best app version BAV(A, T) at the start of handling each request.
The rough policy then is:
for each job J in the array, belonging to batch B for each usable app version AV of type T if B is AFAP an there's no estimate yet if P(H, AV) > 50% send J using AV, with deadline BT + R else x = MD(J, T) if x < C(B) send J using AV, with deadline C(B) else if P(H, AV) > 90% send J using AV, with deadline x
Make an initial pass through the array sending only jobs that have a percentile requirement.
- The 50% and 90% can be parameterized
- Retries are not handled differently at this level, although we could add a restriction like sending them only to top 50% hosts
- In the startup case (e.g. new app) no hosts will be high percentile. How to avoid starvation?
- I think that score-based scheduling is now deprecated.
The feasibility and/or desirability of a job may depend
on what other jobs we're sending,
so it doesn't make sense to assign it a score in isolation.
It's simpler to scan jobs and make a final decision for each one.
There are a few properties we need to give priority to:
- Limited locality scheduling
- beta jobs We can handle these in separate passes, as we're doing now.