refcnt(5)refcnt(5)NAME
Memory Reference Counters - Analysis of memory access patterns
DESCRIPTION
The Origin 2000/200 hardware provides memory reference counters to assist
application programmers in tuning their algorithms for optimal
performance on a NUMA system. These counters are capable of unveiling the
exact memory reference patterns exhibited by an application or a specific
algorithm, enabling the programmer to optimize the application data
layout and to provide specific memory placement hints to the Operating
System in order to maximize cache utilization and locality of memory
access, therefore achieving best memory access performance.
Note that this is an Origin 2000/200 capability only, and does not apply
to other Origin platforms.
IMPLEMENTATION
Hardware Reference Counters
Origin 2000 and Origin 200 systems provide a set of counters for every 4
KB hardware page of memory. The number of counters per set depends on the
number of nodes in the system: for systems with less than 64 nodes (that
is 128 processors) a counter set has one counter per node, and for
systems with more than 64 nodes a counter set has one counter for every 8
nodes. For systems with 64 or less nodes, each counter in a counter set
counts the numbers of references from each of the nodes. Thus, the
application programmer can tell exactly how many references have been
issued to a page from each node in the system. For systems with more than
64 nodes, each counter in a counter set corresponds to the number of
references to a page issued by a group of 8 nodes.
Note that a hardware page is not equivalent to a base software page (or
just page). A hardware page defines the granularity at with the hardware
does reference counting and other hardware operations; a base software
page is the smallest unit of memory that can be mapped by user processes
via the Translation Look-aside Buffer or the Page Tables. For Origin 2000
and Origin 200 systems a hardware page, and therefore the memory
reference counting granularity, is 4 KB; and a base sofware page is 16KB.
For example, consider an 8 node (16 cpu) Origin 2000 system with the
memory configuration shown in the table below. This table shows the
number of hardware pages (equivalent to the number of counter sets), the
number of total counters, and the number of base software pages per node.
For this configuration of 8 nodes, each counter set has 8 counters (one
per node).
Page 1
refcnt(5)refcnt(5)
Memory Configuration
Hardware Counter Total Base
Memory Pages Sets Counters Software Pages
Module Slot [bytes] Mem/4Kb 1/Hpage 8*Sets Mem/16Kb
1 n1 512M 128K 128K 1024K 32K
1 n2 256M 65K 65K 512K 16K
1 n3 256M 65K 65K 512K 16K
1 n4 512M 128K 128K 1024K 32K
2 n1 256M 65K 65K 512K 16K
2 n2 64M 16K 16K 128K 4K
2 n3 64M 16K 16K 128K 4K
2 n4 256M 65K 65K 512K 16K
The length of each counter also depends on the system configuration. For
systems with more than 16 nodes (32 cpus), the counters have a length of
19 bits (maximum count is 0x7ffff). For systems with less than 16 nodes,
the length of the counters depends on the the kind of directory SIMMS
installed on the machine. If STANDARD SIMMS are installed, then the
counters are 11-bit (maximum count 0x7ff); if PREMIUM SIMMS are
installed, then the counters are 19-bit.
Sofware Extended Reference Counters
The hardware counters peg when they reach their maximum count. This is a
problem for the 11-bit counters that would peg after only 0x7ff (2047)
references to a page from one node. To allow application programmers to
keep track of memory references beyond this small number, Cellular Irix
provides Software Extended Memory Reference Counters.
The Extended Counters are implemented as an array of 32-bit counters that
closely mirror the hardware counters, extending their maximum count to
2^32. The hardware counters are setup in such a way that they send an
interrupt when they reach a threshold close to the maximum count. When
this interrupt is received by the operating system, the current hardware
counter count is added to the corresponding software extended counter
mirror, and the hardware counter is reset to 0. This update procedure is
performed for complete counter sets, that is, when we receive the
overflow interrupt we not only update the counter that is overflowing,
but also all the other counters in its set.
INTERFACE
Enabling Reference Counting
To enable reference counting for a section of virtual memory within an
application, the programmer can use a Policy Module (mmci(5)) with the
migration policy set to "MigrationRefcnt".
Hardware Reference Counters
The hardware reference counters for a section of an address space can be
accessed using procfs (proc(4)). The ioctl command code used for this
Page 2
refcnt(5)refcnt(5)
purpose is PIOCGETSN0REFCNTRS. The third argument is used to specify both
the virtual address space range we need the counters for, and the buffer
where the system should copy the counter values to. This argument is of
type sn0_refcnt_args_t, as defined in <sys/SN/hwcntrs.h>:
typedef struct sn0_refcnt_args {
caddr_t vaddr;
long len;
sn0_refcnt_buf_t* buf;
} sn0_refcnt_args_t;
The first field vaddr is the base of the virtual address space range, the
field len is the corresponding length in bytes, and the field buf is a
pointer to a user buffer where the system will store the counter values
and additional information. This buffer is an array of elements of type
sn0_refcnt_buf_t, where each element corresponds to the counter
information associated with one hardware page:
typedef struct sn0_refcnt_buf {
sn0_refcnt_set_t refcnt_set;
__uint64_t paddr;
__uint64_t page_size;
cnodeid_t cnodeid;
} sn0_refcnt_buf_t;
The field refcnt_set contains the set of counters associated with the
virtual address passed via sn0_refcnt_args, paddr is the address of the
physical page associated with this virtual address, page_size is the page
size being used to map it, and cnodeid is the physical page home node,
expressed in terms of Compact Node Identifiers which can be mapped back
to node names using the command topology(1). The refcnt_set type is
defined by
typedef struct sn0_refcnt_set {
refcnt_t refcnt[SN0_REFCNT_MAX_COUNTERS];
__uint64_t flags;
} sn0_refcnt_set_t;
The field refcnt is the actual set of counters (one counter per node),
and flags is a state vector reserved for future use. The counters in
refcnt are ordered according to the Compact Node Identifiers, also known
as cnodeids (numa(5)).
Software Extended Reference Counters
The extended reference counters for a section of an address space can be
accessed using procfs (proc(4)), using practically the same interface
defined above for the hardware reference counters. The ioctl command
Page 3
refcnt(5)refcnt(5)
code used for this purpose is PIOCGETSN0EXTREFCNTRS (the difference
between this command and the command used for the hardware counters is
the prefix EXT before the word REFCNTRS). The third argument is used to
specify both the virtual address space range we need the counters for,
and the buffer where the system should copy the counter values to. This
argument is of type sn0_refcnt_args_t, as defined in <sys/SN/hwcntrs.h>:
typedef struct sn0_refcnt_args {
caddr_t vaddr;
long len;
sn0_refcnt_buf_t* buf;
} sn0_refcnt_args_t;
The first field vaddr is the base of the virtual address space range, the
field len is the corresponding length in bytes, and the field buf is a
pointer to a user buffer where the system will store the counter values
and additional information. This buffer is an array of elements of type
sn0_refcnt_buf_t, where each element corresponds to the counter
information associated with one hardware page:
typedef struct sn0_refcnt_buf {
sn0_refcnt_set_t refcnt_set;
__uint64_t paddr;
__uint64_t page_size;
cnodeid_t cnodeid;
} sn0_refcnt_buf_t;
The field refcnt_set contains the set of counters associated with the
virtual address passed via sn0_refcnt_args, paddr is the address of the
physical page associated with this virtual address, page_size is the page
size being used to map it, and cnodeid is the physical page home node,
expressed in terms of Compact Node Identifiers which can be mapped back
to node names using the command topology(1). The refcnt_set type is
defined by
typedef struct sn0_refcnt_set {
refcnt_t refcnt[SN0_REFCNT_MAX_COUNTERS];
__uint64_t flags;
} sn0_refcnt_set_t;
The field refcnt is the actual set of counters (one counter per node),
and flags is a state vector reserved for future use. The counters in
refcnt are ordered according to the Compact Node Identifiers, also known
as cnodeids (numa(5)).
Page 4
refcnt(5)refcnt(5)
Memory Mapped Software Extended Reference Counters
The extended reference counters can also be accessed by mmapping them to
a user application's virtual address space. This interface is intended to
be used by performance tools that provide a global system view rather
than a localized process view.
This interface is based on a device driver associated with a device that
represents the reference counters for each node in an Origin system.
Here is the list of reference counter devices for an 8 node system:
/hw/module/2/slot/n1/node/refcnt
/hw/module/2/slot/n2/node/refcnt
/hw/module/2/slot/n3/node/refcnt
/hw/module/2/slot/n4/node/refcnt
/hw/module/1/slot/n1/node/refcnt
/hw/module/1/slot/n2/node/refcnt
/hw/module/1/slot/n3/node/refcnt
/hw/module/1/slot/n4/node/refcnt
To map the counters in a node, a user needs to open the refcnt device for
the node, then using the open file descriptor the user needs to obtain
information regarding the counters, defined by rcb_info_t in
<sys/SN/hwcntrs.h>, using ioctl(fd, RCB_INFO_GET, &rcbinfo).
typedef struct rcb_info {
__uint64_t rcb_len; /* total refcnt buffer len in bytes */
int rcb_sw_sets; /* number of sw counter sets in buffer */
int rcb_sw_counters_per_set; /* sw counters per set -- numnodes */
int rcb_sw_counter_size; /* sizeof(refcnt_t)-- size of sw cntr */
int rcb_base_pages; /* number of base pages in node */
int rcb_base_page_size; /* sw base page size */
__uint64_t rcb_base_paddr; /* base physical address for this node */
int rcb_cnodeid; /* cnodeid for this node */
int rcb_granularity; /* hw page size used for counter sets */
uint rcb_hw_counter_max; /* max hwcounter count (width mask) */
int rcb_diff_threshold; /* current node differential threshold */
int rcb_abs_threshold; /* current node absolute threshold */
int rcb_num_slots; /* physmem slots */
} rcb_info_t;
Physical memory in a node is not always contiguous, and therefore
additional information is necessary to determine the counter buffer
location associated with a physical page. Physical memory within a node
is divided into a number of contiguous sections called "slots". The slot
configuration for a node can be obtained using ioctl(fd, RCB_SLOT_GET,
slotconfig), where slot config is of type rcb_slot_t defined in
Page 5
refcnt(5)refcnt(5)
<sys/SN/hwcntrs.h>.
typedef struct rcb_slot {
__uint64_t base; /* Base physical address for slot */
__uint64_t size; /* Size of slot in bytes */
} rcb_slot_t;
CAVEATS
The reference counters when enabled can consume a considerable amount of
memory space for the per-node reference tables.
The reference counters are not virtualized. This means that if a process
starts paging, or its pages start migrating, the counter set associated
with a virtual page will change.
The extended memory reference counters may be out of sync with the
hardware reference counters by up to the hardware reference counter
maximum count (2047 for 11-bit counters and 524287 for 19-bit counters).
SEE ALSO
For more information, see numa(5), mmci(5), proc(4), migration(5), sn(1),
nstats(1)
Page 6