** resolved ** GPU not being used / CPU High Priority Mode / Einstein

Message boards : Questions and problems : ** resolved ** GPU not being used / CPU High Priority Mode / Einstein
Message board moderation

To post messages, you must log in.

AuthorMessage
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75636 - Posted: 2 Feb 2017, 22:01:55 UTC
Last modified: 2 Feb 2017, 22:18:32 UTC

Hi,

BOINC 7.6.33 (x64)
Windows 10 Pro (64 bit)

Computer Name: i7-4790-2

ClimatePrediction
GPUGrid *
Asteroids@Home *
Rosetta@Home
World Community Grid
SETI@Home *
Einstein@Home *
Collatz Conjecture *
PrimeGrid *

I am running an I7 which at this time has two GTX 970's; these have been running various projects for the last few weeks, but today one of the GTX 970s stopped being used for data processing?

BOINC is not installed as a service.

Below is the Event log; I have tried restarting BOINC and doing a Windows restart.

02/02/2017 21:46:46 | | Starting BOINC client version 7.6.33 for windows_x86_64
02/02/2017 21:46:46 | | log flags: file_xfer, sched_ops, task
02/02/2017 21:46:46 | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
02/02/2017 21:46:46 | | Data directory: C:\ProgramData\BOINC
02/02/2017 21:46:46 | | Running under account chris
02/02/2017 21:46:48 | | CUDA: NVIDIA GPU 0: GeForce GTX 970 (driver version 376.33, CUDA version 8.0, compute capability 5.2, 4096MB, 3390MB available, 4423 GFLOPS peak)
02/02/2017 21:46:48 | | CUDA: NVIDIA GPU 1: GeForce GTX 970 (driver version 376.33, CUDA version 8.0, compute capability 5.2, 4096MB, 3390MB available, 4170 GFLOPS peak)
02/02/2017 21:46:48 | | OpenCL: NVIDIA GPU 0: GeForce GTX 970 (driver version 376.33, device version OpenCL 1.2 CUDA, 4096MB, 3390MB available, 4423 GFLOPS peak)
02/02/2017 21:46:48 | | OpenCL: NVIDIA GPU 1: GeForce GTX 970 (driver version 376.33, device version OpenCL 1.2 CUDA, 4096MB, 3390MB available, 4170 GFLOPS peak)
02/02/2017 21:46:48 | | OpenCL: Intel GPU 0: Intel(R) HD Graphics 4600 (driver version 20.19.15.4531, device version OpenCL 1.2, 1630MB, 1630MB available, 240 GFLOPS peak)
02/02/2017 21:46:48 | | OpenCL CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 5.2.0.10094, device version OpenCL 1.2 (Build 10094))
02/02/2017 21:46:48 | | Host name: I7-4790-2
02/02/2017 21:46:48 | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz [Family 6 Model 60 Stepping 3]
02/02/2017 21:46:48 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx smx tm2 pbe fsgsbase bmi1 smep bmi2
02/02/2017 21:46:48 | | OS: Microsoft Windows 10: Professional x64 Edition, (10.00.14393.00)
02/02/2017 21:46:48 | | Memory: 15.89 GB physical, 18.26 GB virtual
02/02/2017 21:46:48 | | Disk: 930.96 GB total, 892.30 GB free
02/02/2017 21:46:48 | | Local time is UTC +0 hours
02/02/2017 21:46:48 | | VirtualBox version: 5.0.18
02/02/2017 21:46:48 | | Config: use all coprocessors

Any help would be much appreciated..
ID: 75636 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75637 - Posted: 2 Feb 2017, 22:04:58 UTC - in response to Message 75636.  
Last modified: 2 Feb 2017, 22:20:16 UTC

This problem seems to be related to Einstein tasks; basically it will only allow one GPU to run at a time when this project is running a single WU.

If I stop suspend all Einstein WUs and leave PrimeGrid and SETI Wus enabled, then two GPUs get used.

If I un-suspend Einstein WU's, suddenly only one GPU get used again?!

This is the first and only time this behaviour has ever happened?

Basically, to get two GPU's running, I have to suspend all Einstein WU's..
ID: 75637 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 75638 - Posted: 2 Feb 2017, 22:33:15 UTC - in response to Message 75637.  

As far as I know, Einstein likes it to have one CPU core per GPU free. If you don't follow that, it's possible that running two GPUs can totally overwhelm your system with Einstein tasks on them. But best ask about that on their forums.
ID: 75638 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75640 - Posted: 2 Feb 2017, 22:41:33 UTC - in response to Message 75638.  
Last modified: 2 Feb 2017, 22:47:18 UTC

Einstein likes it to have one CPU core per GPU free

Boinc automatically reduces the amount of threads running to take care of this.

I did think that this may be related so I dropped CPU usage to 80%; it made no difference.

I have now reset Einstein, and have loaded new WU's, so once SETI tasks have completed I will see if this resolves this "BUG".

I have been running two GPU WUs on all these projects for weeks on this GPU rack (i7). Normally it runs 4 GPUs at once no problem.

If this does resolve this bug, then the only difference has been the number of WU's sitting in the client; I changed from the default amount to 1 day, + 2 extra days - so perhaps this is too much?

** update **

This did not resolve this issue. Only one GPU WU will run when Einstein is running a GPU task.

--

I will now reduce the amount of WUs allowed, delete all WUs sitting in the client, reset all projects, and do a Boinc restart to see if this affects this issue.
ID: 75640 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75641 - Posted: 2 Feb 2017, 22:51:53 UTC - in response to Message 75640.  
Last modified: 2 Feb 2017, 22:55:09 UTC

Something very strange is going on.

All projects are set to no new tasks.

And yet after aborting all WUs, new tasks are sent by Einstein; twice it has now done this.

It has now done this a third time, basically so you cannot abort all tasks!

I have now reset every project.

That has now stopped Einstein sending new GPU tasks.
ID: 75641 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 75643 - Posted: 2 Feb 2017, 23:00:13 UTC - in response to Message 75640.  

Einstein likes it to have one CPU core per GPU free

Boinc automatically reduces the amount of threads running to take care of this.

CPU tasks run at low priority.
GPU tasks run at below-normal priority, a little higher than the low priority.
Einstein's tasks don't fully run on the GPU, they use the CPU for a lot of calculations, more specifically the fourier transform as it is too intensive to run on the GPU.
That's why they like to have a complete CPU core free for the GPU tasks, to be able to calculate the fourier transform without having to fight for the CPU with something else.
ID: 75643 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75644 - Posted: 2 Feb 2017, 23:04:17 UTC - in response to Message 75641.  
Last modified: 2 Feb 2017, 23:13:57 UTC

** Bug definition and Solution **

There seems to be a bug that some computers (mine is a 4 core i7) that if you allow too many WUs to be stored then some projects basically start acting very strange.

* They only allow one GPU to run when they are crunching GPU Wus; if you suspend this project WUs then it goes back to using more than one GPU.

Please note that is is not because the project is out of cores - basically this project had been running these work units across two GPUs using the same amount of free cores for days and days with no problem.

* It issues new tasks after aborting all project Wus, even though the setting 'no new tasks' is set

Clearly this is a bug, because if you tell the client NOT to issue work units, then it should not.

--

The problem seems to be related to the amount of work units you have stored in the client; I say this, as this was the only recent change that had I had done. Additionally by changing it back and doing a reset, it 'fixed' the problem. Previously doing just a reset did not fix the issue.

I had one day WU, plus one additional day WU's, set in the client.

Normally (I think) is 0.5 and additional 0.3 - that is what I set it back to.

What may have caused the issue is that I had fully loaded up on WU's; by this I mean..

Suspend all projects except one, and get the full two days. Then suspend this project, un-suspend another and get the full two days worth of that one. And so on until you have loads of Wus for every project.

--

Anyway, if you have either of these problems, then..

delete all work units
reset each project
changed your stored work units back to more manageable amounts.
restart the client

That is what fixed these issues for me.

Chris
ID: 75644 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75645 - Posted: 2 Feb 2017, 23:15:34 UTC - in response to Message 75644.  
Last modified: 2 Feb 2017, 23:26:43 UTC

Ageless, like I said, Boinc takes care of allocating cores for GPU crunching.

The amount of threads that run in the client are dependent on which GPU work units are running, and how many "cores" they require.

You do not have to manually do this.

Also these Einstein GPU work units have been running on 2 GPUs for nearly a week now with no problem - so clearly the amount of cores that Boinc is allocating is correct for this, and other projects.

The proof of this is the fact that two Einstein GPU tasks are now running, with exactly the same CPU tasks (Prime Grid and World community grid) as they did before, but this time with no problem.

I only run the PrimeGrid sub project PPR CPU, and the previous WCG sub project was 'smash childhood cancer' which it is also now.
ID: 75645 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5081
United Kingdom
Message 75646 - Posted: 2 Feb 2017, 23:24:57 UTC

Do the people experiencing this problem use app_config.xml files, specifically files which include <max_concurrent> elements? I have identified a bug which has similar effects, and reported it as issue #1677 - after twice trying to get David to investigate the issue through private emails. I demonstrated the effect of the line 130 bug by adding additional debug logging to a private build and sending him the resulting output, but received no reply.
ID: 75646 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75647 - Posted: 2 Feb 2017, 23:29:44 UTC - in response to Message 75646.  
Last modified: 2 Feb 2017, 23:31:22 UTC

I am sorry to hear that.

Personally I only use app_config for collatz; however no work units for this project were present in this client.

I have never used <max_concurrent> element on any projects.
ID: 75647 · Report as offensive
ChristianB
Volunteer developer
Volunteer tester

Send message
Joined: 4 Jul 12
Posts: 321
Germany
Message 75658 - Posted: 3 Feb 2017, 9:03:19 UTC

Two clarifications on how things are working:
All projects are set to no new tasks.

And yet after aborting all WUs, new tasks are sent by Einstein; twice it has now done this.

It has now done this a third time, basically so you cannot abort all tasks!

I have now reset every project.

That has now stopped Einstein sending new GPU tasks.
Reseting a project does not abort or cancel the tasks that are currently present on the Client. It just means delete all files belonging to the project on the Client and connect to the server again. The server will then send you what it calls "lost work" which ignores the no new work setting because it is not new. If you want to abort tasks you have to use the abort task command in the Manager and make sure this gets reported to the project before resetting.


Einstein's tasks don't fully run on the GPU, they use the CPU for a lot of calculations, more specifically the fourier transform as it is too intensive to run on the GPU.
That's why they like to have a complete CPU core free for the GPU tasks, to be able to calculate the fourier transform without having to fight for the CPU with something else.
Actually the oposite is the case. The fast fourier transform is done on the GPU and the CPU is mainly used to coordinate work on the GPU. Due to some limitations with the Nvidia OpenCL implementation we have to reserve a full CPU core for the current FGRPB1G GPU application. That was not the case with the former applications which used CUDA not OpenCL on Nvidia GPUs.
ID: 75658 · Report as offensive
floyd
Help desk expert

Send message
Joined: 23 Apr 12
Posts: 77
Message 75668 - Posted: 3 Feb 2017, 11:18:42 UTC

I'm wondering if this isn't just another case of unnoticed high priority mode. That is very often the reason when BOINC doesn't work as expected - still it does work as designed. In earlier BOINC versions the Manager would indicate when tasks ran in high priority mode but that feature was removed. I really think this was a bad decision and should be reverted.
If in this case the CPU tasks were high priority, BOINC would run a full set of them (8 I think) plus one GPU task even if that takes another full core. With the high priority tasks suspended, enough CPU is available for more GPU support.
ID: 75668 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 75678 - Posted: 3 Feb 2017, 13:22:07 UTC - in response to Message 75658.  

Actually the oposite is the case. The fast fourier transform is done on the GPU and the CPU is mainly used to coordinate work on the GPU. Due to some limitations with the Nvidia OpenCL implementation we have to reserve a full CPU core for the current FGRPB1G GPU application. That was not the case with the former applications which used CUDA not OpenCL on Nvidia GPUs.

Ah, all right, thanks for the clarification and correction.
ID: 75678 · Report as offensive
Juha
Volunteer developer
Volunteer tester
Help desk expert

Send message
Joined: 20 Nov 12
Posts: 801
Finland
Message 75697 - Posted: 3 Feb 2017, 20:40:29 UTC - in response to Message 75668.  

floyd wrote:
I'm wondering if this isn't just another case of unnoticed high priority mode.


Pretty likely yes.

ncoded.com wrote:
I changed from the default amount to 1 day, + 2 extra days


Rosetta has tasks that they really need back fast and until yesterday the tasks had two day deadline. Setting minimum cache size to one day is enough to trigger high priority for those tasks. Yesterday they bumped the deadline for those tasks to three days.
ID: 75697 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75703 - Posted: 4 Feb 2017, 6:37:53 UTC
Last modified: 4 Feb 2017, 6:39:39 UTC

Hi,

Thanks for the information and discussion..

Could someone answer the following questions as a layman, as its still not clear what is going on?

1) why did this happen
2) what will stop it happening again
4) what does "unnoticed high priority mode" mean?

Just to clarify:

    No Rosetta WUs had been issued on this machine/client.
    The machines/client that had Rosetta WUs with the same cache size worked fine.
    When we said we 'aborted the WU's' we mean just that; we selected the WUs and clicked the abort button. (3 times)


About the only difference between all the machines was that this was a 4 core i7, and all the other's were 14-16 core Xeon(s).

This client was running on an i7 GPU Rack which has between 2-6 GTX 970's. At this time just two GPU's were being used.

Thanks,

Chris

ID: 75703 · Report as offensive
floyd
Help desk expert

Send message
Joined: 23 Apr 12
Posts: 77
Message 75706 - Posted: 4 Feb 2017, 11:44:23 UTC - in response to Message 75703.  

What is high priority mode?
BOINC has an internal high priority mode (not to be confused with the OS level priority) that is activated when tasks are in danger of missing the deadline. Resources are assigned to high priority tasks first, and those tasks are not interrupted by the usual task switch. One problem with that is that BOINC does no longer tell you when high priority is active. Many people who don't know about it or misinterpret the effects actually think of a malfunction.
It is possible that a GPU is idle because no CPU is available to run a task on it. Somehow BOINC seems to always run one GPU task though. I don't know the reason behind that, it doesn't seem intentional to me.

Why did it happen?
As mentioned, high priority is activated when a task could miss the deadline. Possible reasons are
(a) A task is assigned too short before deadline. Rosetta is an example for this case. There are tasks that were assigned just 2 (now 3) days before deadline and if you don't run a very small cache this will be too close if they aren't pushed ahead. There were complaints about the effects we see here and I've experienced it myself.
(b) Tasks take longer than expected. I'll take Einstein as an example because I think that's where your problem is in this case. You have some pretty fast GPUs and I think you'll be running single tasks on them. That means you're doing many fast tasks. Watch the expected run times. They'll be going down and BOINC will fetch more and more tasks to keep your cache filled. Unfortunately the same happens for CPU tasks and those are not that fast at all. When one of them finishes, the expected run times for the whole lot jump up and BOINC suddenly notices it has more work than it can handle. BANG, trouble.
(c) You can't run as many tasks as expected. (Again taking Einstein as an example and your 4 CPU cores, though the startup message says 8.) You define your cache size in days of work and in the process of translating that to tasks, BOINC assumes 4 CPU cores available, not taking into account that in normal operation you'll run at least 2 GPU tasks and those permanently block 2 CPU cores, leaving only 2 available. So your cache of CPU tasks actually lasts twice as long as expected. And if you run 2 tasks on each of your 2 GPUs, that means NO free CPU cores. Likewise if you have more GPUs. In any case, your cache is probably larger than you think, and certainly larger than BOINC thinks.

What will stop it happening again?
I don't think there's anything you can do to guarantee it won't happen again but you can adjust your setting to reduce the effects. First, make your cache as small as possible. It's only meant for occasional network outages or project downtimes. With a reliable ISP and some backup projects you should be fine with a day or two. Set "Store up to an additional" to a low value to avoid fetching large chunks of work. That could reduce the effects of case (a) above. Personally, I try to avoid running both CPU and GPU tasks for the same project on a single machine to avoid case (b). That's the most likely one, the one with the largest effects and I think it is what we are seeing here, triggered at Einstein it seems.
ID: 75706 · Report as offensive
ncoded.com

Send message
Joined: 13 Dec 16
Posts: 55
United Kingdom
Message 75709 - Posted: 4 Feb 2017, 15:24:16 UTC
Last modified: 4 Feb 2017, 15:30:36 UTC

Hi Floyd, et al.

Thank you so much.

Your answer makes complete sense; putting it like this is helpful not only to people like us who are fairly new to Boinc, but also to those without technical experience and qualifications like we and others have.

Well we are glad to see that we pretty much worked out what was causing the issue, the cache size, even though our analysis of it being a bug rather than a "feature" was obviously incorrect.

Although personally I would say that tasks be re-issued when you have aborted them, and set the project to no new tasks, would seem to be a bug some what; but I will leave that to others to decide.

In terms of the resolution, this is what we did once we guessed what was causing the problem. Basically we reduced our cache size back to normal levels.

We only mainly only run CPU tasks on projects which cannot GPU process and hence never really have both CPU and GPU tasks running on a single project.

In terms of our definition of cores, I guess we are used to calling cores, cores - and threads, threads; but I see that Windows, and Boinc seem's to call threads (logical) cores. So you are correct the i7 has 8 "cores", and the other boxes have 28/32 "cores".

The reason we increased the cache size is that sometimes Boinc is not very good at working out how long a WU has left, and also some projects do not seem to report that quick - which on very large core machines means that sometimes the client runs out of WUs, although only when it is crunching on a single project - which sometimes is required to try and balance our %'s in terms of processing. Also sometimes we lose net connection for a while which also causes this problem.

Personally we actually manage the deadline's ourselves and think it would be better for Boinc not to start messing with priorities (due to dead-lines), especially not stopping GPU's in favour of CPU's which of course can do a tiny fraction of the data processing that GPUs can.

Anyway, thank you again for your help, and to other people for their analysis.

I have now changed the thread title to mark it as resolved, and make it more descriptive of problem and resolution.
ID: 75709 · Report as offensive

Message boards : Questions and problems : ** resolved ** GPU not being used / CPU High Priority Mode / Einstein

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.