[[PageOutline]] = Trouble-shooting a BOINC server = == Trouble-shooting tools == === Log files === Each server component (scheduler, feeder, transitioner, etc.) has its own log file. These files are in the '''log_HOSTNAME''' subdirectory of the project directory. Most error conditions are reported in the log files. If you're interested in the history of a particular job, grep for `WU#12345` or `RESULT#12345` (where 12345 represents the ID) in the log files. The [HtmlOps html/ops pages] also provide an interface for this. To control the verbosity of the log files: * Scheduler: set the desired [ProjectOptions#Logging logging options] * File upload handler: set [ProjectOptions#misc fuh_debug_level]. * daemons: pass the cmdline arg "-d N" (1=least verbose, 4=most verbose) If you run server components with '''-d 4''', their database queries will be logged. This is useful for tracking down database-level problems. === Examining the database === The [wiki:HtmlOps admin web interface] provides a web-based interface for browsing your project's database. You can also use MySQL tools such as * The [http://dev.mysql.com/doc/refman/5.0/en/mysql.html mysql interpreter]. The '[http://dev.mysql.com/doc/refman/5.0/en/show-processlist.html show processlist;]' query is useful for diagnosing DB performance problems. * [http://jeremy.zawodny.com/mysql/mytop/ mytop]: like 'top' for MySQL: shows running queries. * [http://www.phpmyadmin.net/ phpMyAdmin]: general-purpose web interface to MySQL === Examining shared memory === The command {{{ bin/show_shmem }}} will print a textual summary of the contents of the shared-memory structure that caches jobs and information about applications. == Trouble-shooting the job pipeline == * Are workunits (jobs) getting created correctly? Examine the database to see. If you're using a work generator, check its log file. * Are results (job instances) getting created? Examine the database to see. If you don't see results, check the transitioner log file. * Are jobs getting into shared memory? Use show_shmem (see above). You should see jobs. If not, check the feeder log file. * Is the scheduler sending jobs? If not, check its log file, preferably with the following log flags: * : show details of app version selection * : show details of job assignment * : show details of quota enforcement * Are clients processing jobs correctly? Check the status and stderr output of completed jobs. * Are output files getting uploaded? Check the file upload handler log file. * Are jobs getting validated? Check the validator log file. * Are jobs getting assimilated? Check the assimilator log file. == Debugging the scheduler == If the scheduler is acting incorrectly or crashing, and you like mucking around in C++ source code, you can run it under a debugger like `gdb`. The scheduler is a CGI program; it reads a request from stdin and writes a reply to stdout. So you can debug it as follows: * Copy the "scheduler_request_X.xml" file from a client to the machine running the scheduler. (X = your project URL) * Run the scheduler under the debugger, giving it this file as stdin, i.e.: {{{ gdb cgi (set a breakpoint if desired) r < scheduler_request_X.xml }}} * You may have to doctor the database as follows to keep the scheduler from rejecting the request: {{{ update host set rpc_seqno=0, rpc_time=0 where hostid=N }}} As an alternative to this, edit `sched/handle_request.cpp`, and put a call to `debug_sched("debug_sched");` just before `sreply.write(fout, sreq);`. Then, after recompiling, touch a file called 'debug_sched' in the project root directory. This will cause transcripts of all subsequent scheduler requests and replies to be written to the `cgi-bin/` directory with separate small files for each request. The file names are `sched_request_H_R` and `sched_reply_H_R` where H=hostid and R=rpc sequence number. This can be turned off by deleting the 'debug_sched' file. To get core files for scheduler crashes, uncomment the following line in sched/sched_main.cpp, and recompile: {{{ #define DUMP_CORE_ON_SEGV 1 }}}