One of the most common issues with Oracle databases that result in major business issues happen when database resources are overwhelmed with work. When this happens, some or all of the following can happen:
- Response time can increase to the point where no SQL operations are completing from any connected sessions,
- The database and the server or servers on which it is running become completely unresponsive,
- If an application connection pool is not in use, often this will cause session counts to increase to the limit,
- The database will refuse to allow new connections from applications, clients, and even scheduled jobs in the database, bringing work to a standstill.
- Backups can be affected which can affect uptime, and even standby databases can fall behind.
From a business application standpoint, the database – which is frequently the most critical piece of datacenter infrastructure – is dead. If the application is a website for example, it completely stops working.
Often, the failure happens very quickly, the behaviour above is a complete suprise, and efforts are strenuously employed to get it back and working right away by restarting database clients and even the database.
What is the root cause?
There are basically three classes of root causes to the issue above; a given problem will always be for at least one of these reasons, and sometimes all of them:
- The application workload has increased and the query execution plans have not changed, or
- The application workload is the same as it was previously, but something else has changed that causes the database to shatter under the load, or
- The underlying application has changed, or has been “upgraded”.
It’s important to collect performance metrics regularly from the application and database tiers so that the determination of which of the above root causes can quickly be determined. It’s also important to have a log of when applications change, or are upgraded, for the same reason.
Note that it doesn’t matter how capable a database is; it has a finite limit on the amount of work it can do. Any database can be brought to its knees with a trivial amount of effort, even if it’s an X-4 Exadata with 64 nodes.
Let’s assume that it’s clear from application metrics and database SQL metrics that the root cause is #1 – how can this be prevented?
Preventing database meltdown before application loads increase
What can be done to insure that the above never happens? At the core, the problem is that the CPU, memory, disk, I/O, and network resources assigned to Oracle are not up to the task – either in the quality or quantity or both; this is what is limiting the scalability of the application. Admittedly, sometimes business forecasts fall far short of what actually happens in real-life, especially when databases providing services for applications connected to the internet are concerned; for this reason, there needs to be some extra capacity available “just in case”. The first idea – requiring no code changes – is to insure that the database has enough of these resources for ‘peak’ times.
Idea: Insuring enough database resources are allocated
However, these resources cost money, and often it’s not a trivial amount; ideally, the resources needed for a ‘peak’ time are involved to reduce cost. As Oracle licenses databases primarily based on CPU, be warned that adding CPU to an existing Oracle database can be expensive in licensing. Upgrading storage to for example flash will also not be cheap, and sometimes to add memory an entire new machine needs to be acquired.
There are a few things that make it difficult, but not impossible, to predict what application load will cause a database to start to experience these symptoms. The tricky thing is that mathematically, Oracle database resources can be expressed using ‘queuing theory’ (see Craig Shallahamer Forecasting Oracle Performance). When application load is graphed against database response-time, nearly always the graph is not a straight line; it starts out as a straight line, and then there’s a “knee curve” where a small additional amount of application load results in a very large amount of degradation of response time. In other words, given a good testing setup, the application load can be increased little by little, and not suffer much response time degradation until a point. Beyond that point (where the “knee” of the curve is), even tiny increases in application load result in huge increases of response time. So, what is important, is having some idea of where this ‘knee’ occurs, so that additional database resources can be dedicated to the database before the actual peak load occurs.
Of course, sometimes additional scalability can be accomplished by reducing the database resources necessary for a given atomic application server operation.
Idea: better application management of load
For many architectures, the database application server (or application server pool) can be a point where the load on database usage can be throttled. This can be done at a simple level using database connection pools, and by having code that will just refuse to begin a new process if the database will be overwhelmed, usually providing a helpful error message to the client. Unfortunately, in my experience this is sort of difficult to get correct, and requires high-load testing to insure it will work as planned, but it’s a great solution.
Idea: Increasing scalability by improving application code
An alternate to the above is to profile the application to figure out what operations are using the most database resources, and then concentrating on changing application code to require less of them. The database resources that can be retrieved easily for a given application operation can include:
- CPU used,
- logical I/Os,
- Physical I/Os,
- Redo generated,
- block changes,
- memory used
Ideally, when doing quality assurance, these can be tracked over the life of the product. This can result in ‘red flags’ early-on as applications are improved to insure that they are not making things worse.
Ideas of last resort
Let’s assume that none of the above solutions are possible. There are a few processes that can be turned-off to improve performance, but they all have downsides, some severe.
- If redo logs are duplexed, deleting 1 of the duplexed sets will cut in half the amount of redo writes. However, this affects uptime and recoverability; if the one remaining log in a redo log group is lost, the database can’t be recovered.
- If there are scheduled statistics-gathering, auto-sql-tuning, or monitoring jobs, these can be turned off. However, this affects monitoring, sql execution plans, and the ongoing tuning of SQL statements.
- Backups can be de-scheduled; however, this can affect uptime and recoverability possibly leaving the database without a backup,
- If the database is in archived redo log mode (ARCHIVELOGMODE), turning that off can help. Of course, this means that the database isn’t recoverable any more to a given point in time.
- If the database is hitting a ‘number-of-processes’/’number-of-sessions’ limit, increasing the PROCESSES and/or SESSIONS parameters, and decreasing the buffer cache (or MEMORY_TARGET or SGA_TARGET) can sometimes help.
Given a service that may see peaks of activity, it’s important to be able to estimate how much activity will cause database failure and to take steps before it happens. It is not an easy thing to do – to really understand when a production-sized database will fall down, a production database (on the same production hardware) needs to be spun-up, and load testing code needs to be created carefully. If there is a hardware limitation (“No New Hardware!!”), then improving the efficiency of application code with regards to database usage is quite important, as is adding the ability of the application to gracefully throttle database usage.