Changes between Version 21 and Version 22 of CreditNew


Ignore:
Timestamp:
Nov 16, 2009, 4:43:53 PM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v21 v22  
    278278and sets their scaling factor based on the above.
    279279
    280 == Replication and cheating ==
     280== Cheat prevention ==
    281281
    282282Host normalization mostly eliminates the incentive to cheat
     
    285285An exaggerated claim will increase VNPFC*(H,A),
    286286causing subsequent claimed credit to be scaled down proportionately.
     287
    287288This means that no special cheat-prevention scheme
    288289is needed for single replications;
    289 granted credit = claimed credit.
    290 
    291 For jobs that are replicated, granted credit should be
    292 set to the min of the valid results
    293 (min is used instead of average to remove the incentive
    294 for cherry-picking, see below).
    295 
    296 However, there are still some possible forms of cheating.
    297 
    298  * One-time cheats (like claiming 1e304) can be prevented by
    299    capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A).
    300  * Cherry-picking: suppose an application has two types of jobs,
    301   which run for 1 second and 1 hour respectively.
    302   Clients can figure out which is which, e.g. by running a job for 2 seconds
    303   and seeing if it's exited.
    304   Suppose a client systematically refuses the 1 hour jobs
    305   (e.g., by reporting a crash or never reporting them).
    306   Its VNPFC^mean^(H, A) will quickly decrease,
    307   and soon it will be getting several thousand times more credit
    308   per actual work than other hosts!
    309   Countermeasure:
    310   whenever a job errors out, times out, or fails to validate,
    311   set the host's error rate back to the initial default,
    312   and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A.
    313   This puts the host to a state where several dozen of its
    314   subsequent jobs will be replicated.
    315 
    316 == Error rate, host punishment, and turnaround time estimation ==
    317 
    318 Unrelated to the credit proposal, but in a similar spirit.
    319 
    320 Due to hardware problems (e.g. a malfunctioning GPU)
    321 a host may have a 100% error rate for one app version
    322 and a 0% error rate for another.
    323 Similar for turnaround time.
    324 
    325 So we'll move the "error_rate" and "turnaround_time"
    326 fields from the host table to host_app_version.
    327 
    328 The host punishment mechanism is designed to deal with malfunctioning hosts.
    329 For each host the server maintains '''max_results_day'''.
    330 This is initialized to a project-specified value (e.g. 200)
    331 and scaled by the number of CPUs and/or GPUs.
    332 It's decremented if the client reports a crash
    333 (but not if the job was aborted).
    334 It's doubled when a successful (but not necessarily valid)
    335 result is received.
    336 
    337 This should also be per-app-version,
    338 so we'll move "max_results_day" from the host table to host_app_version.
     290in this case, granted credit = claimed credit.
     291
     292For jobs that are replicated,
     293granted credit is set to:
     294 * if the larger host is on scale probation, the smaller
     295 * if larger > 2*smaller, granted = 1.5*smaller
     296 * else granted = (larger+smaller)/2
     297
     298However, two kinds of cheating still have to be dealt with:
     299
     300=== One-time cheats ===
     301
     302For example, claiming a PFC of 1e304.
     303This can be minimized by
     304capping VNPFC(J) at some multiple (say, 20) of VNPFC^mean^(A).
     305If this is enforced, the host's error rate is set to the initial value,
     306so it won't do single replication for a while,
     307and scale_probation (see below) is set to true.
    339308
    340309== Cherry picking ==
     
    394363In this case segments play the role of jobs in the credit-related DB fields.
    395364
     365== Error rate, host punishment, and turnaround time estimation ==
     366
     367Unrelated to the credit proposal, but in a similar spirit.
     368
     369Due to hardware problems (e.g. a malfunctioning GPU)
     370a host may have a 100% error rate for one app version
     371and a 0% error rate for another.
     372Similar for turnaround time.
     373
     374So we'll move the "error_rate" and "turnaround_time"
     375fields from the host table to host_app_version.
     376
     377The host punishment mechanism is designed to deal with malfunctioning hosts.
     378For each host the server maintains '''max_results_day'''.
     379This is initialized to a project-specified value (e.g. 200)
     380and scaled by the number of CPUs and/or GPUs.
     381It's decremented if the client reports a crash
     382(but not if the job was aborted).
     383It's doubled when a successful (but not necessarily valid)
     384result is received.
     385
     386This should also be per-app-version,
     387so we'll move "max_results_day" from the host table to host_app_version.
     388
    396389== Job runtime estimates ==
    397390