2007/07/26

Session-level statspack

Session-level statspack
Filed under: Performance — tanelp @ 3:23 pm

Statspack is a useful tool for easily gathering and reporting some Oracle’s historical workload statistics. However it has its limitations and problems:

One of them that in past it used to record only Oracle statistics, measured from inside Oracle. This made Oracle and people using Oracle ignore other crucial statistics like OS workload below Oracle. For example statspack may show you heavy latch contention as the performance problem, while looking at vmstat output one would see that the server just has been heavily overloaded with numerous other jobs (like multiple backup, export and compress jobs overrunning their run-window) and the latch contention is just a symptom of CPU starvation.

This has been somewhat addressed by adding V$OSSTAT to Oracle 10g and statspack now gathers this info as well. Now we need people to start looking into those stats.

Another problem with statspack is which is described in Dan Fink’s blog, about cursor-level CPU accounting limitations before 10gR2.

However neither of those problems are the main limitations of statspack. The main problem is that statspack samples it’s statistic snapshots from system level views like V$SYSSTAT and V$SYSTEM_EVENT. If every single session in your database is performing exactly the same (kind) of operation, then system level aggregates might be quite OK to get an overview what’s going on in the database. But from the moment you start having different kind of sessions (e.g. OLTP vs batch vs data feed vs report) in your database, then system level aggregates used by statspack do heavily distort what the reality looks like for specific sessions.

There’s another gotcha - by default statspack excludes the “SQL*Net message from client” wait event from main report section, stating that it is not important to look into idle events. However by that we will throw away an important piece of information when diagnosing end-to-end performance, the end user experience. The bad thing is that even if we did include the SQL*Net wait event in our reports (by deleting the corresponding row from STATS$IDLE_EVENT), then we would still have no idea how many of those gazillions of system-wide SQL*Net client wait seconds do belong to our specific session or user.
Statspack doesn’t measure session-experience, it just gives you a system-wide aggregate, which cannot be translated back to individual session statistics (just as you can not convert a hash value back to original value - most of the information is just lost!).

How to get session-level detailed overview of database performance, with historical reporting capability then?

Oracle 10g addresses this partially with ASH and AWR and DBMS_MONITOR’s selective statistic sampling ( V$SERV_MOD_ACT_STATS and V$CLIENT_STATS )

While ASH and AWR are both awesome products, they are not usable in Standard Edition nor before 10g ( you may want to check out Kyle Hailey’s ASH simulator instead ) and for using them you need to pay a separate license fee.

So let me introduce my attempt to solve those problems ( NB! Free stuff!!! ;):
The poor-man’s version of Automatic Workload Repository: It’s called… um… Semi-Automatic Workload Repository :)
The clever-man’s addition to statspack, which takes session-level snapshots: In other words, Sesspack.

The idea is very simple, just take snapshots of V$SESSION_EVENT and some V$SESSTAT (and V$SESS_TIME_MODEL in 10g) statistics and store those in a repository just like statspack does. Snapshots are taken using sesspack.snap_xyz procedures. As a parameter to these procedures I can pass a specific SID, a group of SIDs, a Oracle db username a OS-username or whatever filtering fields can be found from V$SESSION.

Once you have taken your snapshots when running the workload, you can run a report which calculates statistic and wait event deltas between snapshots and that it.

A demo?

The simplest one is to just snap my own session details:

SQL> exec sesspack.snap_me;

PL/SQL procedure successfully completed.

SQL> select avg(length(text)) from dba_source;

AVG(LENGTH(TEXT))
-----------------
114.01304

SQL> exec sesspack.snap_me;

PL/SQL procedure successfully completed.

SQL> @list

Snapshot Snapped
Snap ID Snapshot Time Taken By Mode Sessions
-------- -------------------- -------------------- ---------------------------------------- --------
1 2007-06-24 14:36:47 PERFSTAT SNAP_ME: PERFSTAT 1
2 2007-06-24 14:36:52 PERFSTAT SNAP_ME: PERFSTAT 1


The list.sql has shown that I have two performance snapshots, with ID 1 and 2, so I’ll run a delta report between them with script srs.sql (SAWR Report by SessionID):

SQL> set tab off
SQL> @srs 1 2

SNAPSHOT_BEGIN SNAPSHOT_END DUR_SEC DUR_MIN
----------------- ----------------- ---------- ----------
20070624 14:36:47 20070624 14:36:52 5 .083333333

ms/ ms in Waits in Avg Wait
SID AUDSID Event Name % Total sec snapshot snapshot ms
---------- ---------- ---- --------------------------------------------- ------------ ------- ----------- --------- --------
146 130096 db file scattered read |####### | 676 3378 338 10
CPU Usage |### | 260 1300 0 1300
db file sequential read |# | 44 220 6 37
SQL*Net message from client | | 1 3 3 1
log file sync | | 0 1 1 1
SQL*Net message to client | | 0 0 3 0

6 rows selected.


Another example which should bring the usefulness of sesspack out better:

I dirtied some buffer cache buffers using the delete command, then ran a alter system checkpoint between two snapshots of all background process activity (sesspack.snap_bg).

Tanel@Prod01> delete t;

50705 rows deleted.

Tanel@Prod01> exec sesspack.snap_bg;

PL/SQL procedure successfully completed.

Tanel@Prod01> alter system checkpoint;

System altered.

Tanel@Prod01> exec sesspack.snap_bg;

PL/SQL procedure successfully completed.

Tanel@Prod01> @list

Snapshot Snapped
Snap ID Snapshot Time Taken By Mode Sessions
-------- -------------------- -------------------- ---------------------------------------- --------
1 2007-06-24 14:36:47 PERFSTAT SNAP_ME: PERFSTAT 1
2 2007-06-24 14:36:52 PERFSTAT SNAP_ME: PERFSTAT 1
5 2007-06-24 14:57:49 TANEL SNAP_BG: 14
6 2007-06-24 14:57:55 TANEL SNAP_BG: 14

Tanel@Prod01> @ srs 5 6

SNAPSHOT_BEGIN SNAPSHOT_END DUR_SEC DUR_MIN
----------------- ----------------- ---------- ----------
20070624 14:57:49 20070624 14:57:55 6 .1

ms/ ms in Waits in Avg Wait
SID AUDSID Event Name % Total sec snapshot snapshot ms
------- ---------- ---- --------------------------------------------- ------------ ------- ----------- --------- --------
160 0 rdbms ipc message |##########| 1000 5999 6 1000

161 0 rdbms ipc message |##### | 500 3000 1 3000

162 0 rdbms ipc message |######### | 833 5000 1 5000

165 0 rdbms ipc message |##### | 496 2974 4 744
control file parallel write |## | 159 953 11 87
control file sequential read |# | 57 341 26 13
direct path write | | 0 2 12 0
direct path read | | 0 0 12 0

166 0 rdbms ipc message |##########| 998 5988 3 1996
log file parallel write | | 1 4 2 2

167 0 rdbms ipc message |######### | 818 4911 4 1228
db file parallel write |## | 123 737 93 8

168 0 rdbms ipc message |##########| 1002 6010 2 3005

169 0 rdbms ipc message |##########| 1000 6000 2 3000

170 0 pmon timer |##########| 1000 6000 2 3000

15 rows selected.


An overview of what all background processes were doing at that time :)
If you ever had to diagnose performance issues in a large-scale variable-workload environment with lots of completely different activity going on, you already see the value of session-level statistic snapshots.

When you read the readme, you see that there are various options for taking selective snapshots, like running:

sesspack.snap_sid(57);
sesspack.snap_sid(’150,165,177,202′);
sesspack.snap_orauser(’REPORT_USER’);
sesspack.snap_program(’sqlplus.exe’);
sesspack.snap_machine(’nycprd0005118′);

Also it is possible to snap based on completely custom conditions, using whatever SQL you write to return list of SID’s:
sesspack.snap_sidlist_internal(’select sid from v$session where state=”ACTIVE” and last_call_et > 300′);
sesspack.snap_sidlist_internal(’select sid from v$session where logon_time > sysdate - 1/48′);


Note that there are few crucial issues which need resolving in the snapshot engine (for example, waits are not recorded in V$SESSION_EVENT before event ends or times out, this is a problem with long waits exceeding snapshot interval).

The major work yet to be done is building easy reporting and visualizing capability as this is critical for any useful performance product.
I have experimented with Oracle Application Express for providing a convenient GUI and charting mechanism. Even though APEX charting is still not flexible enough for complex performance data visualization, nevertheless it looks promising! I will release it in version 0.06 perhaps, once rest of the crucial base features have been implemented. I’ll keep you posted.

It all is plain PL/SQL, you can download the package from here: http://www.tanelpoder.com/files/sesspack_0.04.zip

All feedback is appreciated, enjoy!

No comments: