Wednesday, May 20, 2009

Solaris CAT

Sanity Checks

Solaris CAT will run some sanity checks upon startup to look for hints of known problems. These will appear in a section just before the prompt appears, similar to:

sanity checks: settings...vmem...cpu...sysent...clock...misc...done

which is an example of a warning-free startup. An example of one with things to pay attention to is:
sanity checks: settings...vmem...cpu...sysent...clock...misc...

WARNING: console output stopped by ctrl-s (1027 bytes pending)
WARNING: 1 pending softints (softlevel1 queued on CPU1)

done

Two types of messages will appear

  1. Issues which have been observed to cause problems in other systems or cores will begin with WARNING:.
  2. Interesting discoveries which are unlikely to cause problems will be preceded by NOTE:.
By default, all checks mentioned below will appear as warnings unless indicated by (NOTE).

These are intended to point you in a good direction to begin an investigation. They are not necessarily related to the problem being investigated and may, in fact, be wholly irrelevant.

There are two scatenv settings which can be used to control the operation of the sanity checks:

  • sanity_check: Turning this setting off will disable all of the sanity checks.
  • sanity_note: Turning this setting off will disable only the NOTE information.


/etc/system settings

General checks performed on entries in this file are as follows:
set mod:var=val

  • mod not loaded (NOTE)
  • mod loaded or NULL, and var doesn't exist
  • mod loaded or NULL, and var exists but is not STT_OBJECT
  • mod loaded or NULL, and var exists but is not STB_GLOBAL or STB_LOCAL
  • mod:var seen more than once with different val
  • mod:var seen more than once with same val (NOTE)

Checks are also made for the following problems with specific settings (some are intentionally redundant):

  • msginfo_msseg set to > 32767
  • ngroups_max set to <>
  • sq_max_size == 0 or > 100
  • rlim_fd_cur > 1024 on Solaris <>
  • rlim_fd_max > 1024 on Solaris <>
  • lwp_default_stksize not a multiple of _pagesize
  • sd_max_xfer_size set
  • ssd_max_xfer_size set
  • desfree != lotsfree/2
  • minfree != desfree/2
  • throttlefree != minfree
  • cachefree != lotsfree or lotsfree * 2 (Solaris 7 and 8)
  • dyncachefree != lotsfree or cachefree (Solaris 7 and 8)
  • lotsfree < desfree
  • desfree < minfree
  • minfree < throttlefree
  • cachefree < lotsfree
  • ufs_LW > ufs_HW


clock-related sanity checks

All of these are hints that that the kernel clock may not be advancing:
  • panic_hrtime, hres_last_tick, or hrtime_base behind by more than 10 minutes
  • clock_pend or clock_reruns nonzero
  • cyclics pending for more than 10 minutes (Solaris >= 8)
  • callouts which expired more than 1 second ago


cpu structure sanity checks

The list of CPUs on the system are checked for various conditions which could indicate areas of interest:
  • a count of CPUs which are offline
  • CPUs which have cpu_intr_actv set
  • CPUs whose cpu_base_spl is greater than 0
  • CPUs whose last_swtch is greater than 15 seconds ago
  • CPUs which have a thread on the processor using more than 90% CPU
  • CPUs which have a pinned thread using more than 90% CPU
  • CPUs which have more than 5 threads in their dispatch queues
  • CPUs which have threads on their dispatch queue whose t_disp_queue is set to a different CPU
  • CPUs which have an implementation number different than that of the first CPU
  • CPUs which have a clock speed different than that of the first CPU (NOTE)


memory-related sanity checks

  • availrmem <= tune.t_minarmem
  • freemem < throttlefree (page_create() throttled)
  • avefree < minfree (hard swapping)
  • avefree < desfree && freemem <= desfree (soft swapping)
  • avefree < lotsfree (paging)
  • avefree < dyncachefree (paging fs pages)
  • kernel cage checks on Solaris >= 7
    • kcage_freemem < kcage_lotsfree (NOTE)
    • kcage_freemem < kcage_desfree
    • kcage_freemem < kcage_minfree
    • kcage_freemem < kcage_throttlefree
    • kcage_needfree > 0


miscellaneous sanity checks

  • device in use for more than 1 filesystem or swap
  • rootdir is NULL (intentional panic)
  • coredump size doesn't match that calculated from the dumphdr (incomplete or corrupt coredump)
  • DF_LIVE set in dumphdr dump_flags (live coredump)
  • DF_COMPLETE not set in dumphdr dump_flags (incomplete coredump)
  • kernelbase not expected value (corrupt coredump)
  • max_nprocs - nproc <= 0 (ran out of processes)
  • nproc > 90% of maxnprocs (running out of processes)
  • symbol bunyip_vnodeops present (bunyip module loaded) (NOTE)
  • sysent or sysent32 table entries which have sy_call or sy_callc with modules other than those in the following list (system call interceptor code loaded):
    • "genunix"
    • "unix"
    • "pipe"
    • "nfs"
    • "doorfs"
    • "msgsys"
    • "shmsys"
    • "semsys"
    • "kaio"
    • "pset"
    • "cpc"
    • "c2audit"
    • "sysacct"
    • "inst_sync"
    • "srmlimitsys"
    • "rpcmod"
    • "samsys"
    • "samsys64"
    • "autofs"
    • "portfs"
  • processor sets created (NOTE)
  • ndd parameter ip_icmp_err_interval set to 0
  • init process is a zombie
  • disk commands pending
  • pending softints
  • count of pages retired due to errors
  • syncq service threads active (Solaris 8 only) (hung streams)
  • console output stopped by ctrl-s
  • system_taskq with active threads
  • callouts with high bit set in c_runtime
  • ptl1 panics where TL[N] tt is 0x68, TL[N] tpc is a stx (64b) or stw (32b), and the address being stored to is in the onproc thread's redzone (probable stack overflow) (Solaris 9 and above only)
  • vmem checks on Solaris >= 8
    • vmem arenas with threads asleep on the vm_cv (threads waiting for allocations)
  • rmap checks on Solaris <>any of kernelmap, kernelmap32, or kobj_map with:
  • m_want == 1 (thread waiting for an entry)
  • m_free == m_size and m_want == 1 (out of entries in map)

0 comments: