Difference between revisions of "Degradation"

From MidrangeWiki
Jump to: navigation, search
m (Disk cache battery: link to [[QSMBTTCC])
 
(18 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
[[Category:Administration]]
 
[[Category:Administration]]
 +
[[Category:Trouble Shooting]]
 +
We get so we are accustomed to some level of performance with our 400, then unexpectedly it is like someone has put on the brakes.  Whoah, how come it is so sluggish right now?  Here are links to tips and techniques what you can do when this happens, to figure out what is going on, and fix it. 
  
We get so we are accustomed to some level of performance with our 400, then unexpectedly it is like someone has put on the brakes.  Whoah, how come it is so sluggish right now?  Here are links to tips and techniques what you can do when this happens, to figure out what is going on, and fix it. 
+
== Tips & Techniques ==
  
 
These tips can also help when you experience what might be called a slide into oblivion, in which the 400 seems to slowly but steadily appear to be running more slowly.  What the heck is going on, and how do we fix it?
 
These tips can also help when you experience what might be called a slide into oblivion, in which the 400 seems to slowly but steadily appear to be running more slowly.  What the heck is going on, and how do we fix it?
  
* [[BPCS Files]] can have millions of records, going back a year or more, while 90% of the users really only need the last few weeks worth when they access inquiry into those files.  There are enhancements available to archive the older stuff into another library, so it is not part of the standard inquiry, but is available for reports when needed.
+
* [[Backup Save Restore]] might be worth a review before any major changes get made.
 +
* [[BPCS Files]] can have millions of records, going back a year or more, while 90% of the users really only need the last few weeks worth when they access inquiry into those files.  There are enhancements available to archive the older stuff into another library, so it is not part of the standard inquiry, but is available for reports when needed. Check [[BPCS-L]] archives for discussion of alternate 3rd party solutions for this.
 +
* Check [[Disk Space Health]]
 +
* [[DSPJOBTBL]] preferably via several benchmark check points.
 
* DSPMSG '''QCFGMSGQ''' = If you create this message queue, IBM will send to it messages about hardware problems.  If some work station has gone flakey, it can connect, disconnect, connect, have a string of unwanted garbage.
 
* DSPMSG '''QCFGMSGQ''' = If you create this message queue, IBM will send to it messages about hardware problems.  If some work station has gone flakey, it can connect, disconnect, connect, have a string of unwanted garbage.
 
* DSPMSG '''QSYSMSG''' = If you create this message queue, IBM will send to it some messages about very bad stuff, like perhaps the cache battery on your disk drive is going flakey and needs to be replaced.
 
* DSPMSG '''QSYSMSG''' = If you create this message queue, IBM will send to it some messages about very bad stuff, like perhaps the cache battery on your disk drive is going flakey and needs to be replaced.
 
* DSPMSG '''QSYSOPR''' = is there some problem right now awaiting a response?  Do you know what a runaway job is?
 
* DSPMSG '''QSYSOPR''' = is there some problem right now awaiting a response?  Do you know what a runaway job is?
 +
* [[Kill Jobs Preparation]]
 +
* [[Locks and Deadly Locks]]
 +
* [[Manuals]] that can be helpful in this scenario:
 +
** Work Management
 +
* [[Performance Tuning]] may be needed.
 +
* [[Remote Printer Hung]]
 
* [[SYSCMDUSNO]] = one of the [[CLP/400 examples]].  This [[CLP/400]] program lists bad stuff that's recently been going on, such as:
 
* [[SYSCMDUSNO]] = one of the [[CLP/400 examples]].  This [[CLP/400]] program lists bad stuff that's recently been going on, such as:
 
** '''CPF4058''' = Here's ''a file with significant growth'', better do something before it explodes.
 
** '''CPF4058''' = Here's ''a file with significant growth'', better do something before it explodes.
 +
*** See [[DSPFD]] discussion of how to solve file(s) in trouble in [File Command examples of 400 Commands http://wiki.midrange.com/index.php/Category:Commands#File_Commands]
 
** '''CPI1479''' = ''Your 400 has become over-taxed with interactive activity''. Your choices include:  
 
** '''CPI1479''' = ''Your 400 has become over-taxed with interactive activity''. Your choices include:  
 
*** Grin and Bear it
 
*** Grin and Bear it
Line 23: Line 35:
 
*** Downsize the company
 
*** Downsize the company
 
*** Update your resume
 
*** Update your resume
 +
* [[Sub System]] abuse such as a batch job running in Interactive mode
 +
* Consider the merits of [Temporary Logicals http://wiki.midrange.com/index.php/DB2#Temporary_Access_Paths]
 
* [[WHO BAD]] = [[CLP/400]] program to display what jobs are using 3 % of system resources or more.  You can customize where you do your cut-off.
 
* [[WHO BAD]] = [[CLP/400]] program to display what jobs are using 3 % of system resources or more.  You can customize where you do your cut-off.
 
* '''WRKPRB''' = Get at list of recent events that IBM categorizes as hardware problems
 
* '''WRKPRB''' = Get at list of recent events that IBM categorizes as hardware problems
 +
* [[400 101]] could be reviewed in case of any common misconceptions
 +
 +
== Related Troubleshooting ==
 +
 +
There are other problems in which many of the same problem solving tools may need to be referenced.
 +
 +
* A error occurs in the execution of a Job on a Job Queue, and is not immediately noticed by people.  Other Jobs tend to pile up on the [[JOBQ]] until someone, who is accustomed to stuff going into the Q, and completing in a predictable time interval, asks a question, by which time we have a huge pile of jobs waiting, and we need to alter their sequence, in addition to dealing with the hung job.
 +
 +
== Disk cache battery ==
 +
The I/O adapters for the disk drives contain a battery that keeps the adapter's cache memory powered up in case of an unexpected power failure.  That's important because of the single-level store architecture of IBM i; some objects may only exist in memory at any given moment.  If this battery gets old, the system, for safety, bypasses the cache and writes directly to disk.  This will degrade performance.
 +
 +
IBM i issues a message when this happens: '''CPPEA13 - *Attention*  Contact your hardware service provider.'''  When you report this to IBM hardware service, they'll send over a CE with a battery and you'll have to power the system down to replace it.  This is inconvenient.  Better, is to have an idea how much time is left so you can schedule the down time.
 +
 +
The normal way to see the battery status is STRSST, Hardware Service Manager and work with the IOA resources that have batteries.  Or, STRSST, Display/Alter/Dump.
 +
 +
IBM have released PTFs <ref>[http://ibmsystemsmag.blogs.com/i_can/2010/07/i-can-display-the-status-of-your-ioa-cache-batteries.html i Can … Display the Status of Your IOA Cache Batteries] Dawn May, i Can, 19 Jul 2010</ref> that now enable a user with *SERVICE authority the ability to call a program to produce a report of battery status:
 +
 +
{| class="wikitable" border="1"
 +
|+ IOA cache battery PTFs
 +
! i/os version !! PTF
 +
|-
 +
| V5R4 || SI40403
 +
|-
 +
| 6.1 || SI40404
 +
|-
 +
| 7.1 || SI40406
 +
|}
 +
 +
These PTFs can be applied *IMMED, and work against the operating system, not LIC.  Once applied, execute via {{code|CALL QSYS/[[QSMBTTCC]]}}.  This will produce a report similar to:
 +
 +
<pre>
 +
RUNNING MACRO: BATTERYINFO                  -LIST -ALL                       
 +
***LIST OF ALL RESOURCES THAT HAVE CACHE***                                   
 +
                                                      CONCURRENTLY  CAN BE 
 +
RESOURCE  SERIAL          TYPE      FRAME  CARD      MAINTAINABLE  SAFELY 
 +
NAME      NUMBER          MODEL    ID      POSITION  BATTERY PACK  REPLACED
 +
DC06      2D-92nnnnn      572F-001  3C02    C1        YES            NO     
 +
DC09      2D-92nnnnn      575C-001  3C02    C2        YES            NO     
 +
RUNNING MACRO: BATTERYINFO                  -LIST -WARN                     
 +
***LIST OF ALL RESOURCES THAT HAVE CACHE                                     
 +
  WITH THE ESTIMATED TIME TO WARNING IN DAYS***                             
 +
                                                      EST. TIME    EST. TIME
 +
RESOURCE  SERIAL          TYPE      FRAME  CARD      TO WARNING    TO ERROR 
 +
NAME      NUMBER          MODEL    ID      POSITION  (IN DAYS)    (IN DAYS)
 +
DC06      2D-92nnnnn      572F-001  3C02    C1          719          810   
 +
DC09      2D-92nnnnn      575C-001  3C02    C2          719          810   
 +
</pre>
 +
 +
This is followed by details for each unit.
 +
 +
== References ==
 +
<references />

Latest revision as of 19:31, 7 November 2011

We get so we are accustomed to some level of performance with our 400, then unexpectedly it is like someone has put on the brakes. Whoah, how come it is so sluggish right now? Here are links to tips and techniques what you can do when this happens, to figure out what is going on, and fix it.

Tips & Techniques

These tips can also help when you experience what might be called a slide into oblivion, in which the 400 seems to slowly but steadily appear to be running more slowly. What the heck is going on, and how do we fix it?

  • Backup Save Restore might be worth a review before any major changes get made.
  • BPCS Files can have millions of records, going back a year or more, while 90% of the users really only need the last few weeks worth when they access inquiry into those files. There are enhancements available to archive the older stuff into another library, so it is not part of the standard inquiry, but is available for reports when needed. Check BPCS-L archives for discussion of alternate 3rd party solutions for this.
  • Check Disk Space Health
  • DSPJOBTBL preferably via several benchmark check points.
  • DSPMSG QCFGMSGQ = If you create this message queue, IBM will send to it messages about hardware problems. If some work station has gone flakey, it can connect, disconnect, connect, have a string of unwanted garbage.
  • DSPMSG QSYSMSG = If you create this message queue, IBM will send to it some messages about very bad stuff, like perhaps the cache battery on your disk drive is going flakey and needs to be replaced.
  • DSPMSG QSYSOPR = is there some problem right now awaiting a response? Do you know what a runaway job is?
  • Kill Jobs Preparation
  • Locks and Deadly Locks
  • Manuals that can be helpful in this scenario:
    • Work Management
  • Performance Tuning may be needed.
  • Remote Printer Hung
  • SYSCMDUSNO = one of the CLP/400 examples. This CLP/400 program lists bad stuff that's recently been going on, such as:
    • CPF4058 = Here's a file with significant growth, better do something before it explodes.
    • CPI1479 = Your 400 has become over-taxed with interactive activity. Your choices include:
      • Grin and Bear it
      • Bare your company wallet to IBM
      • Check TIMES this is happening (my first choice)
        • If it happens same time each day, and that time coincides with shift change, or lunch break, suggest to some co-workers that if they sign on or off a few minutes before or after shift change, it might go faster.
        • If it happens same time each day, and that time is like the middle of people's work day, then use DSPLOG to see what kind of tasks typically run at that hour.
        • If you identify a particular program that seems like it might be the culprit, take a look at the files it accesses how. Perhaps there is a poorly designed join of some humongous files.
      • Analyse interactive tasks to see if any can be moved to JOBQ (my second choice)
        • Teach co-workers how to send Query to JOBQ.
      • Do something that can get you in big trouble with IBM
      • Downsize the company
      • Update your resume
  • Sub System abuse such as a batch job running in Interactive mode
  • Consider the merits of [Temporary Logicals http://wiki.midrange.com/index.php/DB2#Temporary_Access_Paths]
  • WHO BAD = CLP/400 program to display what jobs are using 3 % of system resources or more. You can customize where you do your cut-off.
  • WRKPRB = Get at list of recent events that IBM categorizes as hardware problems
  • 400 101 could be reviewed in case of any common misconceptions

Related Troubleshooting

There are other problems in which many of the same problem solving tools may need to be referenced.

  • A error occurs in the execution of a Job on a Job Queue, and is not immediately noticed by people. Other Jobs tend to pile up on the JOBQ until someone, who is accustomed to stuff going into the Q, and completing in a predictable time interval, asks a question, by which time we have a huge pile of jobs waiting, and we need to alter their sequence, in addition to dealing with the hung job.

Disk cache battery

The I/O adapters for the disk drives contain a battery that keeps the adapter's cache memory powered up in case of an unexpected power failure. That's important because of the single-level store architecture of IBM i; some objects may only exist in memory at any given moment. If this battery gets old, the system, for safety, bypasses the cache and writes directly to disk. This will degrade performance.

IBM i issues a message when this happens: CPPEA13 - *Attention* Contact your hardware service provider. When you report this to IBM hardware service, they'll send over a CE with a battery and you'll have to power the system down to replace it. This is inconvenient. Better, is to have an idea how much time is left so you can schedule the down time.

The normal way to see the battery status is STRSST, Hardware Service Manager and work with the IOA resources that have batteries. Or, STRSST, Display/Alter/Dump.

IBM have released PTFs [1] that now enable a user with *SERVICE authority the ability to call a program to produce a report of battery status:

IOA cache battery PTFs
i/os version PTF
V5R4 SI40403
6.1 SI40404
7.1 SI40406

These PTFs can be applied *IMMED, and work against the operating system, not LIC. Once applied, execute via CALL QSYS/QSMBTTCC. This will produce a report similar to:

RUNNING MACRO: BATTERYINFO                   -LIST -ALL                        
***LIST OF ALL RESOURCES THAT HAVE CACHE***                                    
                                                       CONCURRENTLY   CAN BE   
RESOURCE   SERIAL          TYPE      FRAME   CARD      MAINTAINABLE   SAFELY   
NAME       NUMBER          MODEL     ID      POSITION  BATTERY PACK   REPLACED 
DC06       2D-92nnnnn      572F-001  3C02    C1        YES            NO       
DC09       2D-92nnnnn      575C-001  3C02    C2        YES            NO       
RUNNING MACRO: BATTERYINFO                   -LIST -WARN                       
***LIST OF ALL RESOURCES THAT HAVE CACHE                                       
   WITH THE ESTIMATED TIME TO WARNING IN DAYS***                               
                                                       EST. TIME     EST. TIME 
RESOURCE   SERIAL          TYPE      FRAME   CARD      TO WARNING    TO ERROR  
NAME       NUMBER          MODEL     ID      POSITION  (IN DAYS)     (IN DAYS) 
DC06       2D-92nnnnn      572F-001  3C02    C1           719           810    
DC09       2D-92nnnnn      575C-001  3C02    C2           719           810    

This is followed by details for each unit.

References