Wednesday, December 24, 2008

_realfree_heap_pagesize_hint - Assessing the impact on Solaris and Linux

The _realfree_heap_pagesize_hint in 10g provides a mechanism by which the process private memory (PGA) can use bigger memory page sizes and thus reduce TLB/TSB misses. This parameter is set in bytes.

This is especially important for Datawarehousing wherein a session can consume significant amount of anonymous memory and in many cases the workarea is bigger than the SGA.

I wrote about TLB/TSB misses from an oracle perspective in an earlier blog here.

http://dsstos.blogspot.com/2008/11/assessing-tlbtsb-misses-and-page-faults.html

 
This parameter is designed to work on the Solaris platform only, however it does work partially on Linux too and probably the same way on other platforms.

As per this hint,

  1. memory extents within the heap would be in _realfree_heap_pagesize_hint chunks.
  2. And these chunks with the memcntl(2) call,  be in _realfree_heap_pagesize_hint sized OS page (provided the pagesize is a valid choice).

For e.g. - An extent of 16MB would be carved upto into 4MB chunks and each 4M chunk would mapped to an individual 4M OS memory page (if the _realfree_heap_pagesize_hint = 4M).

Solaris:


Solaris supports four page sizes on the UltraSparc IV+ platform (8K-default, 64K, 512K and 4M). The default setting for the _realfree_heap_pagesize_hint is 65536 or 64K.

In order to test this parameter, I did a sort on a un-indexed table with approx 3.8 million rows. The avg row length was ~243 bytes and the table approx 1GB in size. The reason I selected such a big table was also to see how memory utilization changed when using different page sizes.

_realfree_heap_pagesize_hint at 65536 (Default)

This implies that when a session requests anon memory, oracle will use 64K pages. However this did not seem to be true. With a setting of 65536, only 8K pages were used.

I did a truss of the shadow process when doing the sort and this is what I observed.

-----------CUT-------------------

19167/1: 5.5795 mmap(0x00000000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 3080192) = 0xFFFFFFFF7A5F0000
19167/1: 5.5796 mmap(0xFFFFFFFF7A5F0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A5F0000
19167/1: 5.5813 mmap(0xFFFFFFFF7A600000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A600000
19167/1: 5.5829 mmap(0xFFFFFFFF7A610000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A610000
19167/1: 5.5846 mmap(0xFFFFFFFF7A620000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A620000
19167/1: 5.5863 mmap(0xFFFFFFFF7A630000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF7A630000

------------------CUT-------------------------------------

As you can see, the extent of size 2M was requested with MAP_NORESERVE and then into 64K chunks. However there is no accompaning memcntl(2) request to ask the OS to allocate 64K pages for the chunks. This is also confirmed when using pmap/trapstat.

trapstat not showing usage of any 64K pages.




pmap output showing anon pages using 8k page size.




Changing the _realfree_heap_pagesize_hint to 512K

Changing the hint to 512K shows that it indeed requests 512K pages from the OS.

------------CUT-------------

19277/1: 14.6646 mmap(0x00000000, 4718592, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 7864320) = 0xFFFFFFFF79780000
19277/1: 14.6647 munmap(0xFFFFFFFF79B80000, 524288) = 0
19277/1: 14.6648 mmap(0xFFFFFFFF79780000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79780000
19277/1: 14.6649 memcntl(0xFFFFFFFF79780000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EC0, 0, 0) = 0
19277/1: 14.6909 mmap(0xFFFFFFFF79800000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF79800000
19277/1: 14.6910 memcntl(0xFFFFFFFF79800000, 524288, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7F80, 0, 0) = 0

---------------CUT-----------------------

As you can see, there is the memcntl(2) call being issued to request the OS to allocate 512K page size. This is also correlated by trapstat and pmap.

trapstat output showing TLB/TSB misses for 512K pages.




pmap output for anon pages showing 512K pages being used.




Changing the _realfree_heap_pagesize_hint to 4M

Changing the hint to 4M also shows that the pagesize being requested is 4M.

Truss output -

-------------------CUT-----------------------

18995/1: 34.0445 mmap(0x00000000, 20971520, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 8, 390070272) = 0xFFFFFFFF53000000
18995/1: 34.0447 munmap(0xFFFFFFFF54000000, 4194304) = 0
18995/1: 34.0448 mmap(0xFFFFFFFF53000000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 8, 0) = 0xFFFFFFFF53000000
18995/1: 34.0449 memcntl(0xFFFFFFFF53000000, 4194304, MC_HAT_ADVISE, 0xFFFFFFFF7FFF7EE0, 0, 0) = 0


-----------------CUT-------------------------


Trapstat output confirming usage of 4M pages for anon memory



And finally pmap output.




So we know now that this does work as expected except for the default setting of 64K. So how does this affect performance?

  1. By using bigger page sizes, we can store more virtual to physical entries in the TLB/TSB and reduce TLB/TSB misses.
  2. Also if using bigger page sizes, it results in a reduction in the number of mmap requests thus reducing CPU spent on system time. For e.g - a 4M extent would require 512 mmap requests if using the default 8K pages, but only 1 mmap request if using a 4M page size.
  3. So memory requests can be addressed significantly faster if using bigger page sizes.
  4. However with bigger pages, one would expect that memory utilization will also go up. The basic denominator for memory requests being page sizes (8K, 512K or 4M), it is possible that there will be memory wastage.

In order to check for memory wastage, I checked the v$sql_workarea_active along with session pga/uga stats to identify how much memory was consumed with different page size setting. By sizing the PGA and using _smm_max_size appropriately, I ensured that the sort completes optimally and in memory without spilling to disk.

With the default setting of 64K

Time taken to complete - 30-32 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1102.92 MB
session uga memory - 1102.3 MB

With 512K
Time taken to complete - 24-28 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1103.73 MB
session uga memory - 1102.2 MB

With 4M

Time taken to complete - 24-27 seconds
Workarea Memory used - 1085.010 MB
session pga memory - 1112.2 MB
session uga memory - 1103.99 MB

Looking at the above stats, for the same sort operation requiring 1GB of workarea, the PGA usage is a fraction higher (~1%) with bigger page sizes. This may impact very big sorts or when multiple sessions running simultaneously - especially when doing parallel operations, so there is always the chance that you may end up with ora-4030 errors if you do not configure your instance appropriately.

Theoretically the timings should improve because of the lesser number of mmap operations and also reduced TLB/TSB misses. All in all, it probably makes sense to use this feature to enable bigger page sizes for Datawarehousing.

On Linux

On Solaris the _realfree_heap_pagesize_hint works well since four different Page sizes (8K, 64K, 512K and 4M) are supported and can be allocated dynamically. However on Linux, only two page sizes are supported (4K and 2M). The 2M pagesize can be allocated only as huge-pages which is used for the SGA. Huge-pages cannot be used for private process memory.

So in Linux, setting the _realfree_heap_pagesize_hint to bigger values only results in _realfree_heap_pagesize_hint sized chunks within extents, however not mapped to physical memory pages of the same size. Since this reduces the number of mmap requests and thus is better than the default.

With the default setting of 64K

------------CUT-------------

mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0xf1) = 0xb70f1000
mmap2(0xb70f1000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb70f1000
mmap2(0xb7101000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb7101000

-------CUT-----------

As you can see from above, 64K chunks are requested.

Changing to 4M

----------CUT-----------
mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x36f1) = 0xb2af1000
mmap2(0xb2af1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2af1000
mmap2(0xb2ef1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb2ef1000
mmap2(0xb32f1000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb32f1000

---------CUT---------
As you can see from the above, with a setting of 4M, the chunks are 4M sized, however there is no request for a 4M page size as this is not feasible in Linux.

Changing to 8M

I was curious to see how this would play out when changing to 8M.

--------CUT-------------
mmap2(NULL, 16777216, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x5af1) = 0xb02f1000
mmap2(0xb02f1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb02f1000
mmap2(0xb0af1000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xb0af1000

--------CUT--------------

The chunks are now 8M in size. I noticed the same behavior in Solaris too (minus memcntl to request an appropriate OS page size).




No comments: