Misassignement of open-cl tasks to gpu's

Message boards : GPUs : Misassignement of open-cl tasks to gpu's
Message board moderation

To post messages, you must log in.

AuthorMessage
Alexander

Send message
Joined: 28 May 10
Posts: 52
Austria
Message 53691 - Posted: 17 Apr 2014, 20:17:35 UTC

Hi,
I have a problem with this host: http://einstein.phys.uwm.edu/show_host_detail.php?hostid=10283382
The problem came up when Einstein released open-cl tasks for ati and for nvidia.
When two tasks are running, one ati and one nvidia, both tasks are running on the ati card.
Full story here: http://einstein.phys.uwm.edu/forum_thread.php?id=10707

Reading the top 20 lines of the messages I see that both gpu's are numbered as 0. I thought that this might be the source of the problem, but Jim posts that this is correct, as far as he knows.

If someone has an idea what could be wrong be shure your advice is appreciated.

Alexander
ID: 53691 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15482
Netherlands
Message 53699 - Posted: 18 Apr 2014, 9:15:30 UTC - in response to Message 53691.  
Last modified: 18 Apr 2014, 9:16:52 UTC

First off, the numbering. This is correct. In theory, you will be able to insert 64 Nvidia GPUs, 64 AMD GPUs and 64 Intel GPUs into your computer and have them all have a reasonably unique number. Where you may see just GPU 0, GPU 1, GPU 2 etc. you're not seeing what it says in front of that:
CUDA: NVIDIA GPU 0
OpenCL: NVIDIA GPU 0
OpenCL: AMD/ATI GPU 0

This is also easy when you use the various exclusion/ignore flags in cc_config.xml
<ignore_intel_dev/>, <ignore_nvidia_dev/> and <ignore_ati_dev/> will use that number.
<exclude_gpu/> will require that you point out the brand of GPU and which number.

Now then, Nvidia OpenCL work running on the AMD GPU. It's difficult to see this from just the screen grabs you made as it doesn't show what is happening in the client. It can just as well be that one AMD OpenCL task runs on the built-in GPU and that the Nvidia OpenCL task has defaulted back to the CPU. The image of GPU-Z doesn't show that.

But BOINC has an easy thing for that, it can show which task runs on what hardware, using the <cpu_sched_debug> and <coproc_debug> flags in cc_config.xml
If you update that host to 7.3.15, you can use the built-in diagnostics window to easily set this flag. Best is also --aside from the default <task>, <sched_ops> and <file_xfer> flags-- to only set this <cpu_sched_debug> and the <coproc_debug> flag, so that when you post the output, we can still read it without too much trouble.

The output will be something alike this:
18/04/2014 11:12:16 |  | [cpu_sched_debug] Request CPU reschedule: periodic CPU scheduling
18/04/2014 11:12:16 |  | [cpu_sched_debug] schedule_cpus(): start
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] scheduling 06ap09ad.23882.179910.438086664207.12.124_1 (coprocessor job, FIFO) (prio -1.000000)
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] reserving 1.000000 of coproc ATI
18/04/2014 11:12:16 |  | [cpu_sched_debug] enforce_run_list(): start
18/04/2014 11:12:16 |  | [cpu_sched_debug] preliminary job list:
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] 0: 06ap09ad.23882.179910.438086664207.12.124_1 (MD: no; UTS: yes)
18/04/2014 11:12:16 |  | [cpu_sched_debug] final job list:
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] 0: 06ap09ad.23882.179910.438086664207.12.124_1 (MD: no; UTS: yes)
18/04/2014 11:12:16 | SETI@home | [coproc] ATI instance 0; 1.000000 pending for 06ap09ad.23882.179910.438086664207.12.124_1
18/04/2014 11:12:16 | SETI@home | [coproc] ATI instance 0: confirming 1.000000 instance for 06ap09ad.23882.179910.438086664207.12.124_1
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] scheduling 06ap09ad.23882.179910.438086664207.12.124_1
18/04/2014 11:12:16 |  | [cpu_sched_debug] using 0.04 out of 2 CPUs
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] 06ap09ad.23882.179910.438086664207.12.124_1 sched state 2 next 2 task state 1
18/04/2014 11:12:16 |  | [cpu_sched_debug] enforce_run_list: end

And do that every minute.
ID: 53699 · Report as offensive
Alexander

Send message
Joined: 28 May 10
Posts: 52
Austria
Message 53709 - Posted: 18 Apr 2014, 19:58:32 UTC - in response to Message 53699.  
Last modified: 18 Apr 2014, 19:59:40 UTC

First off, the numbering. This is correct. In theory, you will be able to insert 64 Nvidia GPUs, 64 AMD GPUs and 64 Intel GPUs into your computer and have them all have a reasonably unique number. Where you may see just GPU 0, GPU 1, GPU 2 etc. you're not seeing what it says in front of that:
CUDA: NVIDIA GPU 0
OpenCL: NVIDIA GPU 0
OpenCL: AMD/ATI GPU 0

This is also easy when you use the various exclusion/ignore flags in cc_config.xml
<ignore_intel_dev/>, <ignore_nvidia_dev/> and <ignore_ati_dev/> will use that number.
<exclude_gpu/> will require that you point out the brand of GPU and which number.

Now then, Nvidia OpenCL work running on the AMD GPU. It's difficult to see this from just the screen grabs you made as it doesn't show what is happening in the client. It can just as well be that one AMD OpenCL task runs on the built-in GPU and that the Nvidia OpenCL task has defaulted back to the CPU. The image of GPU-Z doesn't show that.

But BOINC has an easy thing for that, it can show which task runs on what hardware, using the <cpu_sched_debug> and <coproc_debug> flags in cc_config.xml
If you update that host to 7.3.15, you can use the built-in diagnostics window to easily set this flag. Best is also --aside from the default <task>, <sched_ops> and <file_xfer> flags-- to only set this <cpu_sched_debug> and the <coproc_debug> flag, so that when you post the output, we can still read it without too much trouble.

The output will be something alike this:
18/04/2014 11:12:16 |  | [cpu_sched_debug] Request CPU reschedule: periodic CPU scheduling
18/04/2014 11:12:16 |  | [cpu_sched_debug] schedule_cpus(): start
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] scheduling 06ap09ad.23882.179910.438086664207.12.124_1 (coprocessor job, FIFO) (prio -1.000000)
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] reserving 1.000000 of coproc ATI
18/04/2014 11:12:16 |  | [cpu_sched_debug] enforce_run_list(): start
18/04/2014 11:12:16 |  | [cpu_sched_debug] preliminary job list:
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] 0: 06ap09ad.23882.179910.438086664207.12.124_1 (MD: no; UTS: yes)
18/04/2014 11:12:16 |  | [cpu_sched_debug] final job list:
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] 0: 06ap09ad.23882.179910.438086664207.12.124_1 (MD: no; UTS: yes)
18/04/2014 11:12:16 | SETI@home | [coproc] ATI instance 0; 1.000000 pending for 06ap09ad.23882.179910.438086664207.12.124_1
18/04/2014 11:12:16 | SETI@home | [coproc] ATI instance 0: confirming 1.000000 instance for 06ap09ad.23882.179910.438086664207.12.124_1
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] scheduling 06ap09ad.23882.179910.438086664207.12.124_1
18/04/2014 11:12:16 |  | [cpu_sched_debug] using 0.04 out of 2 CPUs
18/04/2014 11:12:16 | SETI@home | [cpu_sched_debug] 06ap09ad.23882.179910.438086664207.12.124_1 sched state 2 next 2 task state 1
18/04/2014 11:12:16 |  | [cpu_sched_debug] enforce_run_list: end

And do that every minute.



THX Jord for the advise.
I
upgraded BM and ran 2 wu's, result in long form in the Einstein forum.

The short one is:
2014-04-18 19:34:45.1114 (4808) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S6CasA_1.08_windows_x86_64__GWopencl-nvidia-Beta.exe'.
Activated exception handling...
command line: projects/einstein.phys.uwm.edu/einstein_S6CasA_1.08_windows_x86_64__GWopencl-nvidia-Beta.exe --skyRegion=(6.1237713,1.0264572) --refTime=960541454.5 --Freq=993.4000000000001 --FreqBand=0.05 --dFreq=5.3519e-07 --f1dot=-2.71657332393e-09 --f1dotBand=7.76163806836e-11 --df1dot=8.2281e-12 --gammaRefine=90 --f2dot=9.664e-19 --f2dotBand=2.21688997516e-17 --df2dot=1.9328e-18 --gamma2Refine=60 --computeLV --LVuseAllTerms=0 --LVrho=2.7564e+17 --LVlX=0.000168379,0.000168379 --nCand1=3000 --SortToplist=3 --recalcToplistStats=1 -o ../../projects/einstein.phys.uwm.edu/h1_0993.20_S6Directed__S6CasAf40a_993.4Hz_1319_0_0 --printCand1 --semiCohToplist --ephemE=../../projects/einstein.phys.uwm.edu/earth_09_11 --ephemS=../../projects/einstein.phys.uwm.edu/sun_09_11 --segmentList=../../projects/einstein.phys.uwm.edu/seglist-CasAf40.dat --Dterms=8 --DataFiles1=..\..\projects\einstein.phys.uwm.edu\h1_0993.20_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.20_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.25_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.25_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.30_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.30_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.35_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.35_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.40_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.40_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.45_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.45_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.50_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.50_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.55_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.55_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.60_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.60_S6Directed;..\..\projects\einstein.phys.uwm.edu\h1_0993.65_S6Directed;..\..\projects\einstein.phys.uwm.edu\l1_0993.65_S6Directed --device 0
2014-04-18 19:34:45.3766 (4808) [debug]: Flags: LAL_NDEBUG, OPTIMIZE, HS_OPTIMIZATION, X64, SSE, SSE2, GNUC X86 GNUX86
2014-04-18 19:34:45.3766 (4808) [debug]: Set up communication with graphics process.
Code-version: %% LAL: 6.10.0.1 (CLEAN 14312d5a9fafa5b46fc6ccc57a08bdfab14361f1)
%% LALApps: 6.12.0.1 (CLEAN 14312d5a9fafa5b46fc6ccc57a08bdfab14361f1)

2014-04-18 19:34:45.3922 (4808) [normal]: Using OpenCL platform provided by: Advanced Micro Devices, Inc.
2014-04-18 19:34:45.3922 (4808) [normal]: Using OpenCL device "Spectre" by: Advanced Micro Devices, Inc.


Is it OK for you to continue @ Einstein ?

http://einstein.phys.uwm.edu/forum_thread.php?id=10707

Alexander
ID: 53709 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15482
Netherlands
Message 53711 - Posted: 18 Apr 2014, 20:51:43 UTC - in response to Message 53709.  
Last modified: 18 Apr 2014, 20:52:58 UTC

I had seen that already, but dismissed it as Einstein is using an older back-end version for their forums and task lists. It may be a misread by their science application, an artifact of sorts. But best ask them.

I've scoured the log and can only see that the Nvidia OpenCL task runs on the Nvidia GPU, while the AMD OpenCL task runs on the AMD. No switching between them, nor both switching to the same piece of hardware. That I can see from the log, that is.

Start to finish I see:
18.04.2014 19:44:21 | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for h1_0993.20_S6Directed__S6CasAf40a_993.4Hz_1319_0
18.04.2014 19:44:21 | Einstein@Home | [coproc] ATI instance 0; 1.000000 pending for h1_0993.30_S6Directed__S6CasAf40a_993.5Hz_1331_1
18.04.2014 19:44:21 | Einstein@Home | [coproc] NVIDIA instance 0: confirming 1.000000 instance for h1_0993.20_S6Directed__S6CasAf40a_993.4Hz_1319_0
18.04.2014 19:44:21 | Einstein@Home | [coproc] ATI instance 0: confirming 1.000000 instance for h1_0993.30_S6Directed__S6CasAf40a_993.5Hz_1331_1

...

18.04.2014 21:22:12 | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for h1_0993.20_S6Directed__S6CasAf40a_993.4Hz_1319_0
18.04.2014 21:22:12 | Einstein@Home | [coproc] ATI instance 0; 1.000000 pending for h1_0993.30_S6Directed__S6CasAf40a_993.5Hz_1331_1
18.04.2014 21:22:12 | Einstein@Home | [coproc] NVIDIA instance 0: confirming 1.000000 instance for h1_0993.20_S6Directed__S6CasAf40a_993.4Hz_1319_0
18.04.2014 21:22:12 | Einstein@Home | [coproc] ATI instance 0: confirming 1.000000 instance for h1_0993.30_S6Directed__S6CasAf40a_993.5Hz_1331_1

However, please point out the following at Einstein. When you go further down in the log, the task on the Nvidia card seems to change without warning.
Going from
18.04.2014 21:22:12 | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for h1_0993.20_S6Directed__S6CasAf40a_993.4Hz_1319_0
18.04.2014 21:22:12 | Einstein@Home | [coproc] NVIDIA instance 0: confirming 1.000000 instance for h1_0993.20_S6Directed__S6CasAf40a_993.4Hz_1319_0
to
18.04.2014 21:25:25 | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for h1_0993.25_S6Directed__S6CasAf40a_993.45Hz_1329_0
18.04.2014 21:25:25 | Einstein@Home | [coproc] NVIDIA instance 0: confirming 1.000000 instance for h1_0993.25_S6Directed__S6CasAf40a_993.45Hz_1329_0

That's the only weird thing I can find in your log. It doesn't go far enough to see which task eventually uploads. Only the AMD task shows.

Now, here's the clincher: it shouldn't matter what hardware the tasks run on, they're both OpenCL and the OpenCL applications at Einstein are exactly the same ones, whether you run them on the Nvidia, AMD or Intel GPU.
ID: 53711 · Report as offensive
Alexander

Send message
Joined: 28 May 10
Posts: 52
Austria
Message 53712 - Posted: 18 Apr 2014, 21:04:27 UTC - in response to Message 53711.  

Hi Jord,

thank you for your help, let's see what we can make with this information.

Alexander
ID: 53712 · Report as offensive

Message boards : GPUs : Misassignement of open-cl tasks to gpu's

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.