Life of an Oracle DBA: Grumps, dataguard, upgrades and tuning oh my

There are very few things that will get my dander up. One is waisting my time; the other is “Not My Job.” Last week I experienced both; it took a bit but the person who told me it was not their job wound up doing what needed to be done. This caused quite a bit of my time to be wasted. This gets back to my mantra: “You may be asked to do things that our outside of your job description, if so, embrase it and do the best job you can.” You will make yourself more valuable to your employer, customers or team mate.

I also experienced a problem with dataguard that I had not seen before.

ORA-00342: archived log does not have expected resetlogs SCN 137433238

ORA-00334: archived log: ‘/var/opt/data/sor01/arch/1_191_768933256.dbf’

This was weird. This ora error has to do with a different incarnation of the database. What caused the issue? My hunch; someone did a open resetlogs.

I checked what logs have checked the max log on the primary and the max log applied on the standby.

primary> select thread#, max(sequence#) “Last Primary Seq Generated”

from v$archived_log val, v$database vdb where val.resetlogs_change# = vdb.resetlogs_change# group by thread# order by 1;

THREAD# Last Primary Seq Generated

———- ————————–

1 847

standby> select thread#, max(sequence#) “Last Standby Seq Received” from v$archived_log val, v$database vdb where val.resetlogs_change# = vdb.resetlogs_change# group by thread# order by 1;

THREAD# Last Standby Seq Received

———- ————————-

1 847

But wait threes more:

standby > select thread#, max(sequence#) “Last Standby Seq Applied” from v$archived_log val, v$database vdb where val.resetlogs_change# = vdb.resetlogs_change# and val.applied in (‘YES’,’IN-MEMORY’) group by thread# order by 1;

THREAD# Last Standby Seq Applied

———- ————————

1 190

Okay, I see where the standby database wants log 191. But still what is the real issue. I needed more information to figure out what the problem is.

What is the scn that is stored in the controlfile?

select db_unique_name, to_char(current_scn, ‘999999999999999999’) as Standby_Current_SCN from v$database;

DB_UNIQUE_NAME STANDBY_CURRENT_SCN

—————————— ——————-

<radicated> 199126920

I have still not found out and wasted a lot of time trying to get the archive logs to apply. So I deeded to get a incremental backup of the primary database starting with the oldest scn applied to the standby database.

standby> select min(to_char(fhscn, ‘99999999999999999999’)) as SCN_to_be_used from X$KCVFH;

SCN_TO_BE_USED

———————

199126921

So now I have a place to start my incremental backup. But I prefer to live on the conservative side of life. These SCNs’ are not matching up very well, so I’m going to take the lower of the two SCNs’ for my backup.

RMAN> BACKUP INCREMENTAL FROM SCN 199126920 DATABASE FORMAT ‘/var/opt/backup/ForStandby_%U’ tag ‘FORSTANDBY’;

After the backup is done, scp the backup to the standby site and do a database restore. But wait, port 22 is not open so I am going to have to rely on SAN replication. I’ll let you know how the restore goes, the files are still replicating.

Migrate 10G on Solaris to 11G on Linux

There are three small databases that needed to be upgraded 10g Solaris to 11g Linux. Because they were small, I decide the easiest way to do the upgrade was to use datapump. The next set have a lot of data and are moving to a RAC environment, so I will experiment with transportable tablespaces.

Tuning Oracle

We have a web based OLTP application where the database was experiencing 100% cpu. Logins were slowing down, users were backing up. In short it was an extreme terror kinda day. Looking at the Oracle database I found that most of the users were backing up on the same query. My first question was, what was the trigger for 100’s of users to start executing the same query at the same time and for it to be going on all day? The second question was, what can I do to speed things up?

It turns out our 75,000+ user base had received a message to do something. So that explained my first question. The second question was a bit harder to address. There was one index that was getting a fast full index scan for each query and it started showing up in v$session_longops. The index is a compressed index and when I looked at it closely the I found the cardinality of the columns in the index were from higher to lower cardinality. If I wanted to compress the index I can reverse the columns in the index. And because this index mostly experiences fast full index scans that would reduce the number of blocks the index had.

SQL> drop index <schema>.<index_name>;

SQL> create index <schema>.<index_name> on <schema>.<table_name> (lowest_cardinality, next_lowest, highest_cardinality> tablespace indx compress 2;

A generic create compressed example: Sorry I can’t give specifics, NDA and all ya’know.create index app.cust_indx on app.custs (sex, eye_color, cust_id) tablespace indx compress 2;

By changing the order of the columns in the compressed index the number of blocks to store the index decreased by 50%. Well, the load finally returned to quasi normal and requests were getting serviced in a timely mannor. It looks like we have maxed out the oracle database server. Thank goodness we are going to upgrading it to Oracle 11G RAC on Linux from Solaris Oracle 10G single instance.

The life of an Oracle DBA

So you want to be an Oracle DBA?  What does a DBA do? What does my week look like?

Frequently my wife asks me what my days are like and just what do I do at work.  Well, I’m going to start telling you.  My normal answer is a DBAs’ job is extreme boredom punctuated with moments of start terror. Lately there has been more terror then boredom.

So this week started off good enough.  I was in the middle of recovering a standby database when I discovered there were gaps in the archive logs.  I spent most of the weekend trying to get the standby database restored then realized I would have to ship down the archive logs from the primary site.  The problem centered on control_file_record_keep_time being set too low.  Well the archive logs had aged out of the control file.  I had to do a couple of things.

1)      extract the archive logs from ASM.  This can be a trick, then I discovered the package dbms_file_transfer. This package allows you to pull files out of ASM so you can deal with them as real file.   All I had to do was create two directories. One for the ASM source and the second directory the target.

2)      The second part of the problem was getting the files down to the standby database.  Well port 22 is closed and I have to get the SAN Administrator to replicate the files and then get the unix admin to mount the LUN then go to change control.  Bottom line, someone else needed to transfer some files so I was asked to defer transferring the data until the other person is done.  I hope the transfer will be complete by Monday and I can finish fixing the standby database.  So far over a week to fix something that should have took me less then one day.

The following code extracted the logs that I needed to the file system.

declare

x number := 214;

fname varchar2(256);

begin

  while x < 440

  loop

    x := x+1;

    fname := ‘1_’ || to_char(x) || ‘_773923456.dbf’;

    dbms_output.put_line(fname);

    dbms_file_transfer.copy_file (‘ARCHIVE_SOURCE’, fname, ‘ARCHIVE_TARGET’, fname);

  end loop;

end;

/

Next issue for the week. This is what I really love to do. Tune the database.  I got a call from an ole’ friend who’s query was taking a while to execute.  He is a sharp unix admin and knows enough about Oracle and SQL to generate the reports that he needs.  Well his email was short and sweet.  Here’s the code, it’s slow, fix it.  I spent about an hour re-writing his query and took his report from executing in over 2 days down to less then two minutes.  It is important to understand access paths and how the optimizer works. 

Part of my job is to also deal with external customers.  There are some customers who are software vendors that provide services to my customers customer.  (My customer is a very small government agency.  Their customers are banks and other government agencies.)  I don’t know how it happened but if the help desk can not answer the question or address the issue, me and another friend gets the call based on the problem.  All data transfer issues go to me.  All organization issues go to me.  Custom reports go to me.  korn shell scripts got to my friend.  Thank goodness for small favors.

So this week I spent a lot of time on the phone on conference calls with large external customers.  Their issues are simple; we have to get the information securely, accurately and reliably.  I’m happy to say, the issues are getting addressed most of the issues are from the processes that all the organizations have in place.  I sometimes have to remind myself to respect the process.  But there are times when the process is broke and needs to be addressed.

Oh that brings up another problem, I’m not a LAN engineer and I don’t have access to setup VPN tunnels.  So if there is an issue setting up a VPN tunnel then I send that on to someone else.  Ya’ need to understand that.  Well one of the vendors called me and said that one of their customers has been having problems with their VPN tunnel for months.  I sent it on to our LAN engineer and asked him to look into the problem.  I get a terse email back saying “all I can do is open a ticket with <redacted>.” That kinda rubs me the wrong way.  Pick up the phone, call the customer, find out what the problem is?  This does a few things. One it lets the customer know they are not being ignored. Two you have an understanding of what the problems are. This way when someone asks what’s going on with <redacted> then you can answer it with some degree of intelligence.

And that brings up another issue. A couple weeks ago I found out about a customer who has been having troubles sense last January.  It took quite a bit but I pulled together all the resources to address the issue the same day.  But the problem is still dragging on.  I have emailed <redacted> network engineer again today to get a status and he says, “working on it, I’ll let you know.” Well there are a lot of stressed out people who want the problem fixed.  I’m not going to say <redacteds’> network engineer does not know what they are doing but I will say one of our top engineers has serious questions regarding this person’s ability to address the issue.  No one wants him to look bad in front of his boss, so I’m trying to be subtle and have one of our senior engineers help him out with his issues.

Where was I going, oh yea’ the life of a DBA.  You will find you will get tasked with things that seem outside of your core skills.  Embrace them and do the best job you can.

Today I was asked to install Oracle 11.2.0.3 on a blade for our development group.  I got started after the unix group stood up the blade and the problems job kept coming.  Packages were not installed, accounts were not setup correctly.  Even logged in as oracle I could not copy the oracle binaries to the blade.  I’m still not done, I’m hoping the unix group will finish installing the required packages and make the changes to /etc/system so I can finish setting up this database.  Am I frustrated with this, yea’ I am.  Normally I can have a database setup and running in very short order, but having to go back and forth is irritating.  There is a baseline install that gets done for blades.   Perhaps I will ask the unix group to provide me with a copy of the document and make some edits to include setting up for oracle databases.  This way I know what to expect when I get the server.

Other than that, there were many meetings; Change control is an important meeting and you really need to understand what is going on and what the impact of changes are.  There is the daily MMM, some folks call it the morning stand up meeting .  The most important meeting had to do with the technology refresh.  We discussed configuring RAC for one of our critical systems to deal with surge in processing that happens once a year. Our old configuration has three servers. One is the primary database server, a second is a standby database located in another state.  The third is a standby database in the rack with the primary database.  We set it up that way because I would always prefer to have an extra set of data files in my back pocket in case something happened.  This will be changing to configure RAC for performance.

Of course there was the request to get a cold backup of a production database off the standby database to refresh a test environment.  I’ve cloned databases from rman, but now I’m going to figure out if I can create a clone off a standby database.