The week was dominated by turning. The database server was pegged at 100% CPU all week. I have tuned the offending queries to within an inch of their lives. The customer wants to know, why is this happening? And I’m asking the business end, what is driving more users to be active on website.
Early in my career I learned about the top command. With the top command I learn about a few things.
CPU is pegged at 100%. And I this example the CPU has been pegged all day long. The user experience (UE) is suffering.
ldap is the top process.
Oracle has all the other processes.
load averages: 90.29, 92.29, 76.555 13:48:45
390 processes: 297 sleeping, 76 running, 17 on cpu
CPU states: 0.0% idle, 95.7% user, 4.3% kernel, 0.0% iowait, 0.0% swap
Memory: 32G real, 1285M free, 28G swap in use, 5354M swap free
PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1069 ds5user 45 32 0 0K 0K run 31:26 3.04% ns-slapd
5416 oracle 11 23 0 0K 0K run 4:14 1.15% oracle
4322 oracle 11 22 0 0K 0K run 7:59 1.13% oracle
4916 oracle 11 22 0 0K 0K run 4:41 1.11% oracle
5937 oracle 11 23 0 0K 0K run 1:05 1.10% oracle
5494 oracle 11 22 0 0K 0K run 2:13 1.09% oracle
4910 oracle 11 22 0 0K 0K run 4:59 1.08% oracle
Lets dig into a few of the top oracle processes and see the query that is running.
SQL> @pid 5416
old 10: and spid = &1
new 10: and spid = 5416
SPID SID SERIAL# USERNAME SQL_ID
———— ———- ———- ———- ————-
5416 953 55 IFSUSER butxjxuvw2m3y
Here is the pid.sql query I use.
SQL> l
1 select p.spid, s.sid, s.serial#, s.username, s.sql_id from v$process p, v$session s where s.paddr = p.addr and s.status = ‘ACTIVE’ and spid = &1
SQL>
So, sql_id butxjxuvw2m3y is my top CPU consumer. I want to see what the query is and what the plan for the query is.
SQL_ID butxjxuvw2m3y
——————–
select .. from VINBOX this_ where (this_.STATUS<>:1 and this_.RECEIVED>=:2 and (this_.ASSIGNEE=:3 or (this_.RESO in (:4, :5, :6, :7, :8, :9, :10, :11, :12, :13, :14) and (this_.ASSIGNEE is null)))) order by lower(this_.RECEIVED) desc
Plan hash value: 3079052936
———————————————————————————————————————–
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
———————————————————————————————————————–
| 0 | SELECT STATEMENT | | | | | 6300 (100)| |
| 1 | PX COORDINATOR | | | | | | |
| 2 | PX SEND QC (ORDER) | :TQ10001 | 873 | 122K| | 6300 (8)| 00:00:13 |
| 3 | SORT ORDER BY | | 873 | 122K| 312K| 6300 (8)| 00:00:13 |
| 4 | PX RECEIVE | | 1 | 77 | | 3 (0)| 00:00:01 |
| 5 | PX SEND RANGE | :TQ10000 | 1 | 77 | | 3 (0)| 00:00:01 |
| 6 | TABLE ACCESS BY INDEX ROWID | INSTANCE | 1 | 77 | | 3 (0)| 00:00:01 |
| 7 | NESTED LOOPS | | 873 | 122K| | 6270 (8)| 00:00:13 |
| 8 | NESTED LOOPS | | 871 | 58357 | | 3831 (11)| 00:00:08 |
| 9 | PX BLOCK ITERATOR | | | | | | |
| 10 | INDEX FAST FULL SCAN | FORM_IDX06 | 871 | 36582 | | 1185 (33)| 00:00:03 |
| 11 | TABLE ACCESS BY INDEX ROWID| CUSTOM_FORM | 1 | 25 | | 3 (0)| 00:00:01 |
| 12 | INDEX RANGE SCAN | CUSTOM_FORM_IDX03 | 1 | | | 2 (0)| 00:00:01 |
| 13 | INDEX RANGE SCAN | INSTANCE_IDX03 | 1 | | | 2 (0)| 00:00:01 |
———————————————————————————————————————–
I did this for a few of the top processes and found everyone is clicking on the inbox link. Now this has been tuned in the past; so how am I going to deal with this? Well first thing, as the app sits right now, a person can click on the link over and over again. We are putting in a change to disable the link until the inbox comes up. This will save me from users that keep clicking the link when they don’t get their inbox fast enough. The next thing I need to do is move through the query and see where I can optimize it.
Now let’s look to see what the user load is on oracle.
SQL> get users
select username, sql_id, count(*) from v$session where status = ‘ACTIVE’ group by username, sql_id order by 1,3
SQL> /
USERNAME SQL_ID COUNT(*)
———- ————- ———-
APPUSER 9vztbhnd0r47b 1
=================SNIP==============
APPUSER 4c176cpxqvyrs 5
APPUSER 24mdr8u4u6vyz 10
APPUSER 3dkmsrzyjbuj5 44
RLOCKARD 5a921h5xhxxcg 1
8820rg9jt4c14 2
25
18 rows selected.
There are two versions of inbox running right now totaling 54 instances. Well inbox is fairly static during the day; can we use a materialized view? There is a cost, what type of refresh should we be using? 1 hour, 30 minutes, 5 minutes.
There will come a time when the hardware you have in place will not support the user load. In the past month we have seen a 300% increase in the number of queries hitting the database and a 100% increase in the user base.
The customer wants a breakdown by date/hour/user/sql_id/exec count to see what is hammering the system. from this report I can look at date, time of day and user and find out what query is getting executed and how many times. This is one of those reports that is very verbose so I spool it to a file and bring it into Excel to present to the customer. I also ran the same report without sql_id to show date/hour/user/sql count to the database. Both of these report will give you a week worth of data to help you drill down into usage times.
set linesize 256
set pagesize 10000
select to_char(trunc((sample_time),’HH’),’MMDDHH24:MI’), username, sql_id, count(*) from DBA_HIST_ACTIVE_SESS_HISTORY h, dba_users u where h.user_id = u.user_id and u.username = ‘IFSUSER’ group by to_char(trunc((sample_time),’HH’),’MMDDHH24:MI’), username, sql_id order by 1,2;
081709:00 APPUSER 0ut13pyz2862b 1090
081709:00 APPUSER 624p6ufu0vf67 1590
081709:00 APPUSER 0887h37jcvrq0 1677
081709:00 APPUSER 3dkmsrzyjbuj5 14854
Well the sql_id 3dkmsrzyjbuj5 is executing a lot so I wanted to dig into that to see what it is. No surprise, it’s the inbox query … AGAIN.
We have ordered more hardware to put the database on. it’s bigger, more CPUs’, more memory, yada yada yada. When the hardware is in, we will migrate this database from Solaris / Oracle 10g to Linux/Oracle 11G.