Difference between revisions of "Degradation"
m |
MrDolomite (talk | contribs) m (→Disk cache battery: link to [[QSMBTTCC]) |
||
(23 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Administration]] | [[Category:Administration]] | ||
+ | [[Category:Trouble Shooting]] | ||
+ | We get so we are accustomed to some level of performance with our 400, then unexpectedly it is like someone has put on the brakes. Whoah, how come it is so sluggish right now? Here are links to tips and techniques what you can do when this happens, to figure out what is going on, and fix it. | ||
− | + | == Tips & Techniques == | |
These tips can also help when you experience what might be called a slide into oblivion, in which the 400 seems to slowly but steadily appear to be running more slowly. What the heck is going on, and how do we fix it? | These tips can also help when you experience what might be called a slide into oblivion, in which the 400 seems to slowly but steadily appear to be running more slowly. What the heck is going on, and how do we fix it? | ||
+ | * [[Backup Save Restore]] might be worth a review before any major changes get made. | ||
+ | * [[BPCS Files]] can have millions of records, going back a year or more, while 90% of the users really only need the last few weeks worth when they access inquiry into those files. There are enhancements available to archive the older stuff into another library, so it is not part of the standard inquiry, but is available for reports when needed. Check [[BPCS-L]] archives for discussion of alternate 3rd party solutions for this. | ||
+ | * Check [[Disk Space Health]] | ||
+ | * [[DSPJOBTBL]] preferably via several benchmark check points. | ||
+ | * DSPMSG '''QCFGMSGQ''' = If you create this message queue, IBM will send to it messages about hardware problems. If some work station has gone flakey, it can connect, disconnect, connect, have a string of unwanted garbage. | ||
+ | * DSPMSG '''QSYSMSG''' = If you create this message queue, IBM will send to it some messages about very bad stuff, like perhaps the cache battery on your disk drive is going flakey and needs to be replaced. | ||
+ | * DSPMSG '''QSYSOPR''' = is there some problem right now awaiting a response? Do you know what a runaway job is? | ||
+ | * [[Kill Jobs Preparation]] | ||
+ | * [[Locks and Deadly Locks]] | ||
+ | * [[Manuals]] that can be helpful in this scenario: | ||
+ | ** Work Management | ||
+ | * [[Performance Tuning]] may be needed. | ||
+ | * [[Remote Printer Hung]] | ||
* [[SYSCMDUSNO]] = one of the [[CLP/400 examples]]. This [[CLP/400]] program lists bad stuff that's recently been going on, such as: | * [[SYSCMDUSNO]] = one of the [[CLP/400 examples]]. This [[CLP/400]] program lists bad stuff that's recently been going on, such as: | ||
** '''CPF4058''' = Here's ''a file with significant growth'', better do something before it explodes. | ** '''CPF4058''' = Here's ''a file with significant growth'', better do something before it explodes. | ||
+ | *** See [[DSPFD]] discussion of how to solve file(s) in trouble in [File Command examples of 400 Commands http://wiki.midrange.com/index.php/Category:Commands#File_Commands] | ||
** '''CPI1479''' = ''Your 400 has become over-taxed with interactive activity''. Your choices include: | ** '''CPI1479''' = ''Your 400 has become over-taxed with interactive activity''. Your choices include: | ||
*** Grin and Bear it | *** Grin and Bear it | ||
Line 13: | Line 29: | ||
**** If it happens same time each day, and that time coincides with shift change, or lunch break, suggest to some co-workers that if they sign on or off a few minutes before or after shift change, it might go faster. | **** If it happens same time each day, and that time coincides with shift change, or lunch break, suggest to some co-workers that if they sign on or off a few minutes before or after shift change, it might go faster. | ||
**** If it happens same time each day, and that time is like the middle of people's work day, then use DSPLOG to see what kind of tasks typically run at that hour. | **** If it happens same time each day, and that time is like the middle of people's work day, then use DSPLOG to see what kind of tasks typically run at that hour. | ||
+ | **** If you identify a particular program that seems like it might be the culprit, take a look at the files it accesses how. Perhaps there is a poorly designed join of some humongous files. | ||
*** Analyse interactive tasks to see if any can be moved to JOBQ (my second choice) | *** Analyse interactive tasks to see if any can be moved to JOBQ (my second choice) | ||
+ | **** Teach co-workers how to send [[Query]] to JOBQ. | ||
*** Do something that can get you in big trouble with IBM | *** Do something that can get you in big trouble with IBM | ||
*** Downsize the company | *** Downsize the company | ||
*** Update your resume | *** Update your resume | ||
+ | * [[Sub System]] abuse such as a batch job running in Interactive mode | ||
+ | * Consider the merits of [Temporary Logicals http://wiki.midrange.com/index.php/DB2#Temporary_Access_Paths] | ||
* [[WHO BAD]] = [[CLP/400]] program to display what jobs are using 3 % of system resources or more. You can customize where you do your cut-off. | * [[WHO BAD]] = [[CLP/400]] program to display what jobs are using 3 % of system resources or more. You can customize where you do your cut-off. | ||
+ | * '''WRKPRB''' = Get at list of recent events that IBM categorizes as hardware problems | ||
+ | * [[400 101]] could be reviewed in case of any common misconceptions | ||
+ | |||
+ | == Related Troubleshooting == | ||
+ | |||
+ | There are other problems in which many of the same problem solving tools may need to be referenced. | ||
+ | |||
+ | * A error occurs in the execution of a Job on a Job Queue, and is not immediately noticed by people. Other Jobs tend to pile up on the [[JOBQ]] until someone, who is accustomed to stuff going into the Q, and completing in a predictable time interval, asks a question, by which time we have a huge pile of jobs waiting, and we need to alter their sequence, in addition to dealing with the hung job. | ||
+ | |||
+ | == Disk cache battery == | ||
+ | The I/O adapters for the disk drives contain a battery that keeps the adapter's cache memory powered up in case of an unexpected power failure. That's important because of the single-level store architecture of IBM i; some objects may only exist in memory at any given moment. If this battery gets old, the system, for safety, bypasses the cache and writes directly to disk. This will degrade performance. | ||
+ | |||
+ | IBM i issues a message when this happens: '''CPPEA13 - *Attention* Contact your hardware service provider.''' When you report this to IBM hardware service, they'll send over a CE with a battery and you'll have to power the system down to replace it. This is inconvenient. Better, is to have an idea how much time is left so you can schedule the down time. | ||
+ | |||
+ | The normal way to see the battery status is STRSST, Hardware Service Manager and work with the IOA resources that have batteries. Or, STRSST, Display/Alter/Dump. | ||
+ | |||
+ | IBM have released PTFs <ref>[http://ibmsystemsmag.blogs.com/i_can/2010/07/i-can-display-the-status-of-your-ioa-cache-batteries.html i Can … Display the Status of Your IOA Cache Batteries] Dawn May, i Can, 19 Jul 2010</ref> that now enable a user with *SERVICE authority the ability to call a program to produce a report of battery status: | ||
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |+ IOA cache battery PTFs | ||
+ | ! i/os version !! PTF | ||
+ | |- | ||
+ | | V5R4 || SI40403 | ||
+ | |- | ||
+ | | 6.1 || SI40404 | ||
+ | |- | ||
+ | | 7.1 || SI40406 | ||
+ | |} | ||
+ | |||
+ | These PTFs can be applied *IMMED, and work against the operating system, not LIC. Once applied, execute via {{code|CALL QSYS/[[QSMBTTCC]]}}. This will produce a report similar to: | ||
+ | |||
+ | <pre> | ||
+ | RUNNING MACRO: BATTERYINFO -LIST -ALL | ||
+ | ***LIST OF ALL RESOURCES THAT HAVE CACHE*** | ||
+ | CONCURRENTLY CAN BE | ||
+ | RESOURCE SERIAL TYPE FRAME CARD MAINTAINABLE SAFELY | ||
+ | NAME NUMBER MODEL ID POSITION BATTERY PACK REPLACED | ||
+ | DC06 2D-92nnnnn 572F-001 3C02 C1 YES NO | ||
+ | DC09 2D-92nnnnn 575C-001 3C02 C2 YES NO | ||
+ | RUNNING MACRO: BATTERYINFO -LIST -WARN | ||
+ | ***LIST OF ALL RESOURCES THAT HAVE CACHE | ||
+ | WITH THE ESTIMATED TIME TO WARNING IN DAYS*** | ||
+ | EST. TIME EST. TIME | ||
+ | RESOURCE SERIAL TYPE FRAME CARD TO WARNING TO ERROR | ||
+ | NAME NUMBER MODEL ID POSITION (IN DAYS) (IN DAYS) | ||
+ | DC06 2D-92nnnnn 572F-001 3C02 C1 719 810 | ||
+ | DC09 2D-92nnnnn 575C-001 3C02 C2 719 810 | ||
+ | </pre> | ||
+ | |||
+ | This is followed by details for each unit. | ||
+ | |||
+ | == References == | ||
+ | <references /> |
Latest revision as of 19:31, 7 November 2011
We get so we are accustomed to some level of performance with our 400, then unexpectedly it is like someone has put on the brakes. Whoah, how come it is so sluggish right now? Here are links to tips and techniques what you can do when this happens, to figure out what is going on, and fix it.
Tips & Techniques
These tips can also help when you experience what might be called a slide into oblivion, in which the 400 seems to slowly but steadily appear to be running more slowly. What the heck is going on, and how do we fix it?
- Backup Save Restore might be worth a review before any major changes get made.
- BPCS Files can have millions of records, going back a year or more, while 90% of the users really only need the last few weeks worth when they access inquiry into those files. There are enhancements available to archive the older stuff into another library, so it is not part of the standard inquiry, but is available for reports when needed. Check BPCS-L archives for discussion of alternate 3rd party solutions for this.
- Check Disk Space Health
- DSPJOBTBL preferably via several benchmark check points.
- DSPMSG QCFGMSGQ = If you create this message queue, IBM will send to it messages about hardware problems. If some work station has gone flakey, it can connect, disconnect, connect, have a string of unwanted garbage.
- DSPMSG QSYSMSG = If you create this message queue, IBM will send to it some messages about very bad stuff, like perhaps the cache battery on your disk drive is going flakey and needs to be replaced.
- DSPMSG QSYSOPR = is there some problem right now awaiting a response? Do you know what a runaway job is?
- Kill Jobs Preparation
- Locks and Deadly Locks
- Manuals that can be helpful in this scenario:
- Work Management
- Performance Tuning may be needed.
- Remote Printer Hung
- SYSCMDUSNO = one of the CLP/400 examples. This CLP/400 program lists bad stuff that's recently been going on, such as:
- CPF4058 = Here's a file with significant growth, better do something before it explodes.
- See DSPFD discussion of how to solve file(s) in trouble in [File Command examples of 400 Commands http://wiki.midrange.com/index.php/Category:Commands#File_Commands]
- CPI1479 = Your 400 has become over-taxed with interactive activity. Your choices include:
- Grin and Bear it
- Bare your company wallet to IBM
- Check TIMES this is happening (my first choice)
- If it happens same time each day, and that time coincides with shift change, or lunch break, suggest to some co-workers that if they sign on or off a few minutes before or after shift change, it might go faster.
- If it happens same time each day, and that time is like the middle of people's work day, then use DSPLOG to see what kind of tasks typically run at that hour.
- If you identify a particular program that seems like it might be the culprit, take a look at the files it accesses how. Perhaps there is a poorly designed join of some humongous files.
- Analyse interactive tasks to see if any can be moved to JOBQ (my second choice)
- Teach co-workers how to send Query to JOBQ.
- Do something that can get you in big trouble with IBM
- Downsize the company
- Update your resume
- CPF4058 = Here's a file with significant growth, better do something before it explodes.
- Sub System abuse such as a batch job running in Interactive mode
- Consider the merits of [Temporary Logicals http://wiki.midrange.com/index.php/DB2#Temporary_Access_Paths]
- WHO BAD = CLP/400 program to display what jobs are using 3 % of system resources or more. You can customize where you do your cut-off.
- WRKPRB = Get at list of recent events that IBM categorizes as hardware problems
- 400 101 could be reviewed in case of any common misconceptions
Related Troubleshooting
There are other problems in which many of the same problem solving tools may need to be referenced.
- A error occurs in the execution of a Job on a Job Queue, and is not immediately noticed by people. Other Jobs tend to pile up on the JOBQ until someone, who is accustomed to stuff going into the Q, and completing in a predictable time interval, asks a question, by which time we have a huge pile of jobs waiting, and we need to alter their sequence, in addition to dealing with the hung job.
Disk cache battery
The I/O adapters for the disk drives contain a battery that keeps the adapter's cache memory powered up in case of an unexpected power failure. That's important because of the single-level store architecture of IBM i; some objects may only exist in memory at any given moment. If this battery gets old, the system, for safety, bypasses the cache and writes directly to disk. This will degrade performance.
IBM i issues a message when this happens: CPPEA13 - *Attention* Contact your hardware service provider. When you report this to IBM hardware service, they'll send over a CE with a battery and you'll have to power the system down to replace it. This is inconvenient. Better, is to have an idea how much time is left so you can schedule the down time.
The normal way to see the battery status is STRSST, Hardware Service Manager and work with the IOA resources that have batteries. Or, STRSST, Display/Alter/Dump.
IBM have released PTFs [1] that now enable a user with *SERVICE authority the ability to call a program to produce a report of battery status:
i/os version | PTF |
---|---|
V5R4 | SI40403 |
6.1 | SI40404 |
7.1 | SI40406 |
These PTFs can be applied *IMMED, and work against the operating system, not LIC. Once applied, execute via CALL QSYS/QSMBTTCC
. This will produce a report similar to:
RUNNING MACRO: BATTERYINFO -LIST -ALL ***LIST OF ALL RESOURCES THAT HAVE CACHE*** CONCURRENTLY CAN BE RESOURCE SERIAL TYPE FRAME CARD MAINTAINABLE SAFELY NAME NUMBER MODEL ID POSITION BATTERY PACK REPLACED DC06 2D-92nnnnn 572F-001 3C02 C1 YES NO DC09 2D-92nnnnn 575C-001 3C02 C2 YES NO RUNNING MACRO: BATTERYINFO -LIST -WARN ***LIST OF ALL RESOURCES THAT HAVE CACHE WITH THE ESTIMATED TIME TO WARNING IN DAYS*** EST. TIME EST. TIME RESOURCE SERIAL TYPE FRAME CARD TO WARNING TO ERROR NAME NUMBER MODEL ID POSITION (IN DAYS) (IN DAYS) DC06 2D-92nnnnn 572F-001 3C02 C1 719 810 DC09 2D-92nnnnn 575C-001 3C02 C2 719 810
This is followed by details for each unit.
References
- ↑ i Can … Display the Status of Your IOA Cache Batteries Dawn May, i Can, 19 Jul 2010