Some top causes for unplanned outages

I read a blog from Mr. Gavin Soorma and find it vary useful to protect database from unplanned outage. Just record it here…

Some top causes for unplanned outages

I have quickly thought about some of the events which had caused unplanned downtime at my past organisation. Idea is to share the same and ensure our current monitoring infrastructure is taking into account aspects discussed below – better to learn from other’s experiences than to experience it yourself.

  • Listener log file grew > 2GB in a Red Hat environment. This prevented user sessions from connecting even though the listener was running. Not sure if there is an OS file limit in AIX, but if logging is enabled, a good practice will be to truncate the log on a regular basis.
  • Controlfile was multiplexed over three mount points and one of the mount points got 100% full. Error thrown by the application as well as alert log is ORA-0600 with a lot of arguments. Will not specifically indicate a disk space issue when looked up on Metalink.
  • Temporary Tablespaces created using large size for the tempfiles. At creation time, Oracle will not check if sufficient disk space is available or not. Problem will manifest itself when some operation like a sort or index build will start using temporary tablespace disk space and disk space is not adequate.
  • MAXEXTENTS value reached for tables and indexes. Application outage occurred when new tables and indexes were created and developer hard coded a very small value for MAXEXTENTS instead of leaving it to the default value which in UNLIMITED.
  • Disk holding archive log suddenly got filled due to a spurt in redo activity caused by a month end or year end batch job. Good practice is to have an automated RMAN job which will backup and delete archive log files if a certain disk space usage threshold is crossed.
  • Mount point holding Oracle software binaries got full. This will prevent any SQL*PLUS connections directly to the host machine. Since the trace file and core dump directories are usually on the same mount point as the ORACLE_HOME, we should ensure that effective daily housekeeping jobs are in place to not only clean up old trace files but also to rename the alert log on a daily basis so that the alert log only has entries for one day. The alert log is normally visited when a problem occurs and having the alert log have entries for the past 6 months or take a long time for the file to open using vi because it has grown very large is not a good practice.
  • Procedures or Packages got INVALID because some DDL activity was done on another object with dependencies. A good practice is to ensure that no INVALID objects exist in the database and to run a job which polls the database for any objects that have got invalidated. This will also serve from an auditing viewpoint as it may identify if anyone is doing an unauthorised modifications to the database.
  • Data Guard log shipping terminated because of some activity on the Primary database like password change for SYS, datafile added or resized on the Primary database, but the corresponding mount point on the standby machine did not have adequate space to accommodate the resize. Best practice is to ensure that the disk space allocation and mount point mapping is the same on both Primary as well as Standby machines.
  • Public database links are used by the application and the password of the user account used in the database link was changed as part of a monthly password change. Best practice is to never change the password for the user account that is used to connect via the database link.
  • Limit for parameters SESSIONS or PROCESSES which are defined in the init,ora was crossed. The only way to fix this is to kill sessions at the OS level or change the parameters and restart the database. A good practice would be to have an alerting mechanism which warns you if the threshold set is being reached. That way we can be proactive and kill any idle or inactive sessions so that this upper limit is not reached.
  • Planned outage for a data movement exercise was extended because Import of data failed because of lack of tablespace free space in target database.Consequently the whole exercise had to be repeated. Best practice is to use the RESUMABLE parameter while doing such operations so that they can be resumed from the point of failure and also to write a Database Event Trigger which can alert the DBA in case such an event happens.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: