Changes between Version 19 and Version 20 of CreditNew


Ignore:
Timestamp:
Nov 16, 2009, 1:03:49 PM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v19 v20  
    299299  subsequent jobs will be replicated.
    300300
     301== Error rate, host punishment, and turnaround time estimation ==
     302
     303Unrelated to the credit proposal, but in a similar spirit.
     304
     305Due to hardware problems (e.g. a malfunctioning GPU)
     306a host may have a 100% error rate for one app version
     307and a 0% error rate for another.
     308Similar for turnaround time.
     309
     310So we'll move the "error_rate" and "turnaround_time"
     311fields from the host table to host_app_version.
     312
     313The host punishment mechanism is designed to deal with malfunctioning hosts.
     314For each host the server maintains '''max_results_day'''.
     315This is initialized to a project-specified value (e.g. 200)
     316and scaled by the number of CPUs and/or GPUs.
     317It's decremented if the client reports a crash
     318(but not if the job was aborted).
     319It's doubled when a successful (but not necessarily valid)
     320result is received.
     321
     322This should also be per-app-version,
     323so we'll move "max_results_day" from the host table to host_app_version.
     324
     325== Cherry picking ==
     326
     327Suppose an application has a mix of long and short jobs.
     328If a client intentionally discards
     329(or aborts, or reports errors from) the long jobs,
     330but completes the short jobs,
     331its host scaling factor will become large,
     332and it will get excessive credit for the short jobs.
     333This is called "cherry picking".
     334
     335The host punishment mechanism
     336doesn't deal effectively with cherry picking,
     337
     338We propose the following mechanism to deal with cherry picking:
     339
     340 * For each (host, app version) maintain "host_scale_time".
     341   This is the earliest time at which host scaling will be applied.
     342 * for each (host, app version) maintain "scale_probation"
     343   (initially true).
     344 * When send a job to a host,
     345   if scale_probation is true,
     346   set host_scale_time to now+X, where X is the app's delay bound.
     347 * When a job is successfully validated,
     348   and now > host_scale_time,
     349   set scale_probation to false.
     350 * If a job times out or errors out,
     351   set scale_probation to true,
     352   max the scale factor with 1,
     353   and set host_scale_time to now+X.
     354 * when computing claimed credit for a job,
     355   and now < host_scale_time, don't use the host scale factor
     356
     357The idea is to apply the host scaling factor
     358only if there's solid evidence that the host is NOT cherry picking.
     359
     360Because this mechanism is punitive to hosts
     361that experience actual failures,
     362we'll make it selectable on a per-application basis (default off).
     363
     364In addition, to limit the extent of cheating
     365(in case the above mechanism is defeated somehow)
     366the host scaling factor will be min'd with a
     367project-wide config parameter (default, say, 3).
     368
    301369== Trickle credit ==
    302370