Single-column Primary Keys and Histograms in Oracle

For every timeless adage there is at least one mistaken thought that can become ‘truth’ simply by virtue of being repeated. Oracle is no exception to this phenomenon especially when single-column Primary Key indexes are involved. The conventional ‘wisdom’ discounts the use of histograms on such an index as unnecessary, and this presumes that the primary key has no ‘holes’ and has a continuous range of data. Very often reality flies in the face of that assumption; even with triggers and sequences, it is highly unlikely to find an unbroken, gap-free data set where that primary key is concerned. Absent a histogram, Oracle is left to its own devices and presumptions which can take an execution path ‘into the weeds’. Let’s look at what Oracle expects to find, what Oracle can find and how a histogram can help.

Oracle expects data sets to be uniform in distribution when no other information is available; create a table, load some data, leave off indexes and statistics, and Oracle will generate execution plans based on the assumption of uniform data distribution — keys are not clustered, have no gaps or holes and just generally ‘play nice’. This, of course, does not reflect the real world which is why Oracle provides ways to generate statistics on table data with and without histograms. In the unique case of a single-column, single-valued key a histogram is likely not necessary, but in many other cases involving single-column keys, especially primary keys, histograms may provide tremendous help in generating a reasonable execution plan. Three of the more ‘popular’ histogram types are frequency, height-balanced and hybrid, the latter available in versions 12.x and later. Histograms basically draw a ‘picture’ of the data that Oracle can use to better assess execution steps and produce acceptable plans. Since examples are usually good let’s build one for this situation and see how query execution changes based on changing information.

Histograms are based, basically, on data distribution. Oracle records such statistical data as it scans through the table so it can be used later to determine the best route of data access. If a table is created and populated with 1,000 rows containing random numeric values the resulting ‘map’ should produce no ‘hot spots’ where one or a handful of values dominate the data distribution. Repeating values in a key will behave differently, as that same 1,000 rows populated with a limited set of values (1 through 22, for example) is guaranteed to produce repeating, non-unique keys. Now let’s look at how a primary key is defined and why a histogram could be very helpful. A primary key is:

  • Unique
  • Has no NULL values

which guarantees that each and every key will be populated with usable data. What it does NOT guarantee is that the set of key values:

  • Is sequential
  • Is gapless
  • Has a uniform difference between adjacent key values

That last statement gives rise to the condition of key clustering, groups of values that produce small differences within the group but may experience large ‘jumps’ between adjacent sets of grouped values, making the data anything but uniformly distributed across the range of key values. For example, if in our example table the key is defined as my_key_num, a number, a very distinct possibility exists that gaps (holes) can be created even when no records are deleted simply due to the nature of how a cached sequence behaves. For this example, a sequence is created that starts with the number 1, increases by 1 for each request for a new value and each instantiation of the sequence (one for each user session using that sequence) caches 20 values. Now presume that 12 user sessions are using that sequence; the first session to fetch data from that sequences fetches the first 20 values, the next session in line allocates the next 20 values, and so on until all 12 sessions have values cached. There is no possible way sequence values allocated in that manner will ever be ‘in order’, and the actual data may look more like:

	1,41,71,21,31,81,91,61,111,101,121...

than

	1,2,3,4,5,6,7,8,9,10,11,...

since sessions have not been allocated ‘overlapping’ sequence values. In a RAC environment the ORDERED directive can be included with the sequence that should produce ordered sequence values across sessions but that also requires that caching not be used. Even THEN, failed inserts and rollbacks cause sequences to ‘lose’ values, creating gaps. So unless this is a single-user system there is no possibility of truly sequential, gap free primary key values.

Building and populating a table based on the above description is where this example begins:


SQL>
SQL> create table pk_hist_test
  2  as
  3  with testdata as (
  4  	     select  --+ materialize
  5  		     rownum myid
  6  	     from dual
  7  	     connect by
  8  		     level <= 1e4
  9  )
 10  select
 11  	     rownum  myid
 12  from
 13  	     testdata	     v1,
 14  	     testdata	     v2
 15  where
 16  	     rownum <= 1e4
 17  ;

Table created.

SQL>
SQL>
SQL> column max_id_1 new_value max1
SQL> column max_id_2 new_value max2
SQL> column max_id_3 new_value max3
SQL>
SQL> select min(myid), max(myid) max_id_1
  2  from pk_hist_test;

 MIN(MYID)   MAX_ID_1
---------- ----------
         1      10000

SQL>

The table has 10,000 rows in numeric order and no gaps. Let’s add data, but put a large gap in the key sequence:


SQL> insert into pk_hist_test select myid + 1e6 from pk_hist_test;

10000 rows created.

SQL>
SQL> select min(myid), max(myid) max_id_2
  2  from pk_hist_test;

 MIN(MYID)   MAX_ID_2
---------- ----------
         1    1010000

SQL>

Let’s do that again:


SQL>
SQL> insert into pk_hist_test select myid + 1e7 from pk_hist_test;

20000 rows created.

SQL>
SQL> select min(myid), max(myid) max_id_3
  2  from pk_hist_test;

 MIN(MYID)   MAX_ID_3
---------- ----------
         1   11010000

SQL>

Add a primary key to the table:


SQL>
SQL> alter table pk_hist_test add constraint pk_hist_test_pk primary key(myid);

Table altered.

SQL>

Oracle has generated the following basic statistics on the data, based on the primary key index:


SQL>
SQL> column column_name format a35
SQL> set linesize 150
SQL>
SQL> select
  2  	     column_name, sample_size,
  3  	     num_distinct, num_nulls, density,
  4  	     histogram, num_buckets
  5  from
  6  	     user_tab_cols
  7  where
  8  	     table_name = 'PK_HIST_TEST'
  9  order by
 10  	     column_name
 11  ;

COLUMN_NAME                         SAMPLE_SIZE NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM       NUM_BUCKETS
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------
MYID                                      10000        10000          0      .0001 NONE                      1

SQL>

Notice no histogram exists; querying the data to see what Oracle predicts for the number of returned rows:


SQL>
SQL> set autotrace on
SQL> select
  2  	     *
  3  from    pk_hist_test
  4  where
  5  	     myid between 10001 and 1010017
  6  ;

      MYID
----------
   1000001
   1000002
   1000003
   1000004
   1000005
   1000006
   1000007
   1000008
   1000009
   1000010
   1000011
...
   1009998
   1009999
   1010000

10000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1838269289

------------------------------------------------------------------------------------
| Id  | Operation        | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |                 |     1 |     4 |     2   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| PK_HIST_TEST_PK |     1 |     4 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("MYID">=10001 AND "MYID"<=1010017)


Statistics
----------------------------------------------------------
         17  recursive calls
          0  db block gets
        703  consistent gets
         23  physical reads
          0  redo size
     199741  bytes sent via SQL*Net to client
       7878  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      10000  rows processed

SQL> set autotrace off

The query returned 10,000 rows but the optimizer only expected 1 row to be returned. Now create a histogram on the data and try the same query again:


SQL>
SQL> begin
  2  	     dbms_stats.gather_table_stats(
  3  		     ownname	      => user,
  4  		     tabname	      =>'PK_HIST_TEST',
  5  		     estimate_percent => null
  6  	     );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL>
SQL> set autotrace on
SQL> select
  2  	     *
  3  from    pk_hist_test
  4  where
  5  	     myid between 10001 and 1010017
  6  ;

      MYID
----------
   1000001
   1000002
   1000003
   1000004
   1000005
   1000006
   1000007
   1000008
   1000009
   1000010
   1000011
...
   1010000

10000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

------------------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                |  9960 | 49800 |    20   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| PK_HIST_TEST   |  9960 | 49800 |    20   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("MYID"<=1010017 AND "MYID">=10001)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        690  consistent gets
          0  physical reads
          0  redo size
     199741  bytes sent via SQL*Net to client
       7878  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      10000  rows processed

SQL> set autotrace off

The histogram switched the query from an index scan to a full table scan due to the volume of data returned. Notice also the better estimate on the number of rows returned by the query; an estimate aided by the existence of the histogram. Looking at the statistics now we see that a HYBRID histogram exists on the data, created because the number of ‘buckets’ is far less than the number of distinct values:


SQL>
SQL> column column_name format a35
SQL> set linesize 150
SQL>
SQL> select
  2  	     column_name, sample_size,
  3  	     num_distinct, num_nulls, density,
  4  	     histogram, num_buckets
  5  from
  6  	     user_tab_cols
  7  where
  8  	     table_name = 'PK_HIST_TEST'
  9  order by
 10  	     column_name
 11  ;

COLUMN_NAME                         SAMPLE_SIZE NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM       NUM_BUCKETS
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------
MYID                                      40000        40000          0    .000025 HYBRID                  254

SQL>

Contrary to popular belief, creating a histogram on a primary key is not such an outlandish thought. Even though the index access was ‘lost’, we were expecting to see the plan is better suited to the data since a large portion of the table data is returned. Also contrary to popular belief — table scans are not ‘evil’ as they can reduce the overall I/O when large data volumes are involved.

Primary keys, even though they are comprised of unique, non-null data, can benefit from a histogram; it pays to remember that unless your data is in a vacuum (which it isn’t) gaps and clusters in primary key data will occur and histograms can help the optimizer by providing a more accurate view of the data distribution within the key. This leads to better execution plans and possibly faster data access, which should make both the DBA team and the end-users happy.

See all articles by David Fitzjarrell

David Fitzjarrell
David Fitzjarrell
David Fitzjarrell has more than 20 years of administration experience with various releases of the Oracle DBMS. He has installed the Oracle software on many platforms, including UNIX, Windows and Linux, and monitored and tuned performance in those environments. He is knowledgeable in the traditional tools for performance tuning – the Oracle Wait Interface, Statspack, event 10046 and 10053 traces, tkprof, explain plan and autotrace – and has used these to great advantage at the U.S. Postal Service, American Airlines/SABRE, ConocoPhilips and SiriusXM Radio, among others, to increase throughput and improve the quality of the production system. He has also set up scripts to regularly monitor available space and set thresholds to notify DBAs of impending space shortages before they affect the production environment. These scripts generate data which can also used to trend database growth over time, aiding in capacity planning. He has used RMAN, Streams, RAC and Data Guard in Oracle installations to ensure full recoverability and failover capabilities as well as high availability, and has configured a 'cascading' set of DR databases using the primary DR databases as the source, managing the archivelog transfers manually and montoring, through scripts, the health of these secondary DR databases. He has also used ASM, ASMM and ASSM to improve performance and manage storage and shared memory.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles