INTRO_SHMEM(3)INTRO_SHMEM(3)NAMEintro_shmem - Introduction to shared memory access routines
DESCRIPTION
The shared memory access (SHMEM) routines provide low-latency, high-
bandwidth communication for use in highly parallelized scalable
programs. The routines in the SHMEM application programming interface
(API) provide a programming model for exchanging data between
cooperating parallel processes. The resulting programs are similar in
style to Message Passing Interface (MPI) programs. The SHMEM API can
be used either alone or in combination with MPI routines in the same
parallel program.
A SHMEM program is SPMD (single program, multiple data) in style. The
SHMEM processes, called processing elements or PEs, all start at the
same time, and they all run the same program. Usually the PEs perform
computation on their own subdomains of the larger problem, and
periodically communicate with other PEs to exchange information on
which the next communication phase depends.
The SHMEM routines minimize the overhead associated with data transfer
requests, maximize bandwidth, and minimize data latency. Data latency
is the period of time that starts when a PE initiates a transfer of
data and ends when a PE can use the data.
SHMEM routines support remote data transfer through put operations,
which transfer data to a different PE, get operations, which transfer
data from a different PE, and remote pointers, which allow direct
references to data objects owned by another PE. Other operations
supported are collective broadcast and reduction, barrier
synchronization, and atomic memory operations. An atomic memory
operation is an atomic read-and-update operation, such as a
fetch-and-increment, on a remote or local data object.
SHMEM Routines
This section lists the significant SHMEM message-passing routines.
* PE queries:
C/C++ only: _num_pes(3I), _my_pe(3I)
Fortran only: NUM_PES(3I), MY_PE(3I)
* Elemental data put routines:
C/C++ only: shmem_double_p, shmem_float_p, shmem_int_p,
shmem_long_p, shmem_short_p
* Block data put routines:
C/C++ and Fortran: shmem_put32, shmem_put64, shmem_put128
C/C++ only: shmem_double_put, shmem_float_put,
shmem_int_put, shmem_long_put,
shmem_short_put
Fortran only: shmem_complex_put, shmem_integer_put,
shmem_logical_put, shmem_real_put
* Elemental data get routines:
C/C++ only: shmem_double_g, shmem_float_g, shmem_int_g,
shmem_long_g, shmem_short_g
* Block data get routines:
C/C++ and Fortran: shmem_get32, shmem_get64, shmem_get128
C/C++ only: shmem_double_get, shmem_float_get,
shmem_int_get, shmem_long_get,
shmem_short_get
Fortran only: shmem_complex_get, shmem_integer_get,
shmem_logical_get, shmem_real_get
* Strided put routines:
C/C++ and Fortran: shmem_iput32, shmem_iput64, shmem_iput128
C/C++ only: shmem_double_iput, shmem_float_iput,
shmem_int_iput, shmem_long_iput,
shmem_short_iput
Fortran only: shmem_complex_iput, shmem_integer_iput,
shmem_logical_iput, shmem_real_iput
* Strided get routines:
C/C++ and Fortran: shmem_iget32, shmem_iget64, shmem_iget128
C/C++ only: shmem_double_iget, shmem_float_iget,
shmem_int_iget, shmem_long_iget,
shmem_short_iget
Fortran only: shmem_complex_iget, shmem_integer_iget,
shmem_logical_iget, shmem_real_iget
* Point-to-point synchronization routines:
C/C++ only: shmem_int_wait, shmem_int_wait_until,
shmem_long_wait, shmem_long_wait_until,
shmem_longlong_wait,
shmem_longlong_wait_until, shmem_short_wait,
shmem_short_wait_until
Fortran: shmem_int4_wait, shmem_int4_wait_until,
shmem_int8_wait, shmem_int8_wait_until
* Barrier synchronization routines:
C/C++ and Fortran: shmem_barrier_all, shmem_barrier
* Atomic memory fetch-and-operate (fetch-op) routines:
C/C++ and Fortran: shmem_swap
* Reduction routines:
C/C++ only: shmem_int_and_to_all, shmem_long_and_to_all,
shmem_longlong_and_to_all,
shmem_short_and_to_all,
shmem_double_max_to_all,
shmem_float_max_to_all, shmem_int_max_to_all,
shmem_long_max_to_all,
shmem_longlong_max_to_all,
shmem_short_max_to_all,
shmem_double_min_to_all,
shmem_float_min_to_all, shmem_int_min_to_all,
shmem_long_min_to_all,
shmem_longlong_min_to_all,
shmem_short_min_to_all,
shmem_double_sum_to_all,
shmem_float_sum_to_all, shmem_int_sum_to_all,
shmem_long_sum_to_all,
shmem_longlong_sum_to_all,
shmem_short_sum_to_all,
shmem_double_prod_to_all,
shmem_float_prod_to_all,
shmem_int_prod_to_all,
shmem_long_prod_to_all,
shmem_longlong_prod_to_all,
shmem_short_prod_to_all, shmem_int_or_to_all,
shmem_long_or_to_all,
shmem_longlong_or_to_all,
shmem_short_or_to_all, shmem_int_xor_to_all
shmem_long_xor_to_all
shmem_longlong_xor_to_all
shmem_short_xor_to_all,
Fortran only: shmem_int4_and_to_all, shmem_int8_and_to_all,
shmem_real4_max_to_all,
shmem_real8_max_to_all,
shmem_int4_max_to_all, shmem_int8_max_to_all,
shmem_real4_min_to_all,
shmem_real8_min_to_all,
shmem_int4_min_to_all, shmem_int8_min_to_all,
shmem_real4_sum_to_all,
shmem_real8_sum_to_all,
shmem_int4_sum_to_all, shmem_int8_sum_to_all,
shmem_real4_prod_to_all,
shmem_real8_prod_to_all,
shmem_int4_prod_to_all,
shmem_int8_prod_to_all, shmem_int4_or_to_all,
shmem_int8_or_to_all, shmem_int4_xor_to_all,
shmem_int8_xor_to_all
* Broadcast routines:
C/C++ and Fortran: shmem_broadcast32, shmem_broadcast64
* Generalized barrier synchronization routine:
C/C++ and Fortran: shmem_barrier
* Cache management routines:
C/C++ and Fortran: shmem_udcflush, shmem_udcflush_line
* Byte-granularity block put routines:
C/C++ and Fortran: shmem_putmem and shmem_getmem
Fortran only: shmem_character_put and shmem_character_get
* Collect routines:
C/C++ and Fortran: shmem_collect32, shmem_collect64,
shmem_fcollect32, shmem_fcollect64
* Atomic memory fetch-and-operate (fetch-op) routines:
C/C++ only: shmem_double_swap, shmem_float_swap,
shmem_int_cswap, shmem_int_fadd,
shmem_int_finc, shmem_int_swap,
shmem_long_cswap, shmem_long_fadd,
shmem_long_finc, shmem_long_swap,
shmem_longlong_cswap, shmem_longlong_fadd,
shmem_longlong_finc, shmem_longlong_swap
Fortran only: shmem_int4_cswap, shmem_int4_fadd,
shmem_int4_finc, shmem_int4_swap,
shmem_int8_swap, shmem_real4_swap,
shmem_real8_swap, shmem_int8_cswap
* Atomic memory operation routines:
Fortran only: shmem_int4_add, shmem_int4_inc
* Remote memory pointer function:
C/C++ and Fortran: shmem_ptr
* Reduction routines:
C/C++ only: shmem_longdouble_max_to_all,
shmem_longdouble_min_to_all,
shmem_longdouble_prod_to_all,
shmem_longdouble_sum_to_all
Fortran only: shmem_real16_max_to_all,
shmem_real16_min_to_all,
shmem_real16_prod_to_all,
shmem_real16_sum_to_all
* Accessibility query routines:
C/C++ and Fortran: shmem_pe_accessible, shmem_addr_accessible
Symmetric Data Objects
Consistent with SHMEM's SPMD programming style is the concept of
symmetric data objects, which are arrays or variables that exist with
the same size, type, and relative address on all PEs. Another term
for symmetric data objects is "remotely accessible data objects." In
the interface definitions for SHMEM data transfer routines, one or
more of the parameters are typically required to be symmetric or
remotely accessible.
The following kinds of data objects are symmetric:
* Fortran data objects in common blocks or with the SAVE attribute.
These data objects must not be defined in a dynamic shared object
(DSO).
* Non-stack C and C++ variables. These data objects must not be
defined in a DSO.
* Fortran arrays allocated with shpalloc(3F)
* C and C++ data allocated by shmalloc(3C)
Collective Routines
Some SHMEM routines, for example, shmem_broadcast(3) and
shmem_float_sum_to_all(3), are classified as collective routines
because they distribute work across a set of PEs. They must be called
concurrently by all PEs in the active set defined by the PE_start,
logPE_stride, PE_size argument triplet. The following man pages
describe the SHMEM collective routines:
* shmem_and(3)
* shmem_barrier(3)
* shmem_broadcast(3)
* shmem_collect(3)
* shmem_max(3)
* shmem_min(3)
* shmem_or(3)
* shmem_prod(3)
* shmem_sum(3)
* shmem_xor(3)
Using the Symmetric Work Array, pSync
Multiple pSync arrays are often needed if a particular PE calls a
SHMEM collective routine twice without intervening barrier
synchronization. Problems would occur if some PEs in the active set
for call 2 arrive at call 2 before processing of call 1 is complete by
all PEs in the call 1 active set. You can use shmem_barrier() or
shmem_barrier_all(3) to perform a barrier synchronization between
consecutive calls to SHMEM collective routines.
There are two special cases:
* The shmem_barrier(3) routine allows the same pSync array to be used
on consecutive calls as long as the active PE set does not change.
* If the same collective routine is called multiple times with the
same active set, the calls may alternate between two pSync arrays.
The SHMEM routines guarantee that a first call is completely
finished by all PEs by the time processing of a third call begins
on any PE.
Because the SHMEM routines restore pSync to its original contents,
multiple calls that use the same pSync array do not require that pSync
be reinitialized after the first call.
ENVIRONMENT VARIABLES
This section describes the variables that specify the environment
under which your SHMEM programs will run. On IRIX, these also affect
the way 64-bit MPI programs will run. Environment variables have
predefined values. You can change some variables to achieve
particular performance objectives.
Barrier Related Environment Variables
The default behavior of the SHMEM barrier can be modified using the
following environment variables:
SMA_BAR_COUNTER (IRIX systems only)
Specifies the use of a simple counter barrier algorithm.
Default: Enabled for jobs with be PE counts less than 64
SMA_BAR_DISSEM (IRIX systems only)
Specifies the use of the alternate barrier algorithm, the
dissemination/butterfly, within the shmem_barrier_all(3)
function. This alternate algorithm provides better performance
on jobs with larger PE counts.
Default: Enabled for jobs with be PE counts of 64 or higher
SMA_NO_FETCHOP (IRIX systems only)
Disables the use of hardware "fetchops" in the barrier algorithm.
Default: Not enabled
Symmetric Heap Related Environment Variables
The default behavior of the symmetric heap can be modified using the
following environment variables:
SMA_SYMMETRIC_SIZE (Also available on SGI Altix 3000 systems)
Specifies the size, in bytes, of the symmetric heap memory per
PE.
Default: On IRIX systems 67108864 bytes (64MB) per PE. On Altix
it is the total machine memory divided by the number of
processors on the system.
SMA_SYMMETRIC_SHATTR_OFF (IRIX systems only)
Starting with IRIX 6.5.18, the symmetric heap makes use of system
V shared memory segments with shared page tables to reduce kernel
memory requirements. Setting this environment variable will
disable the use of this feature.
Default: Not enabled for IRIX 6.5.18 and higher
SMA_SYMMETRIC_PREATTACH (IRIX systems only)
Starting with IRIX 6.5.18, the symmetric heap is implemented with
system V shared memory segments. To minimize kernel resource
requirements, these segments are normally attached only when
necessary. Setting this environment variable might lead to some
improvement in runtime performance at the expense of longer job
startup and shutdown times.
Default: Not enabled for IRIX 6.5.18 and higher
SMA_SYMMETRIC_METHOD (IRIX systems only)
Allows for controlling the method used to implement the symmetric
heap. This environment variable can be set to one of the
following values:
Value Action
mmap With this setting, a shared
/dev/zero mapping is used to
implement the symmetric heap. This
is the default method for IRIX
6.5.17 and older releases.
sysv With this setting, system V shared
memory segments with the shared
page table attribute are used to
implement the symmetric heap. This
is the default method for IRIX
6.5.18 and higher.
Static Cross Mapping Related Environment Variables
The default behavior of the SHMEM static cross mapping procedure can
be modified using the following environment variables:
SMA_STATIC_PREATTACH (IRIX systems only)
Starting with IRIX 6.5.2, static cross mapping is implemented
primarily using system V shared memory segments. To minimize
kernel resource requirements, these segments are normally
attached only when necessary. Setting this environment variable
might lead to some improvement in runtime performance at the
expense of significantly increased startup and shutdown times for
high PE count jobs.
Default: Not enabled for IRIX 6.5.2 and higher
SMA_STATIC_SHATTR_OFF (IRIX systems only)
Starting with IRIX 6.5.15, system V shared memory segments with
shared page tables are used to cross map static memory when
possible. The static section must be significantly larger than
32 MBytes in order to use shared page tables. Setting this
environment variable disables the use of this feature.
Default: Not enabled for IRIX 6.5.15 and higher
SMA_STATIC_METHOD (IRIX systems only)
Allows for controlling the method used to implement the static
cross mapping. This environment variable can be set to one of
the following values:
Value Action
mmap With this setting, memory mapped
files are used to implement the
static cross mapping. This is the
default for early IRIX 6.5
releases.
sysv With this setting, system V systems
are used to implement the static
cross mapping. This is the default
method for IRIX 6.5.2 and higher.
SMA_PREATTACH (IRIX systems only)
Setting this shell variable is equivalent to setting both
SMA_SYMMETRIC_PREATTACH and SMA_STATIC_PREATTACH.
Default: Not enabled for IRIX 6.5.2 and higher
SMA_SHATTR_OFF (IRIX systems only)
Setting this shell variable is equivalent to setting both
SMA_SYMMETRIC_SHATTR_OFF and SMA_STATIC_SHATTR_OFF.
Default: Not enabled for IRIX 6.5.2 and higher
Debugging Related Environment Variables
Several environment variables are available to assist in debugging
SHMEM applications:
SMA_COREFILE (IRIX systems only)
Setting this environment variable causes the SHMEM library to
generate a corefile if an error is encountered at job startup.
Default: Not enabled
SMA_DEBUG (Also available on SGI Altix 3000 systems)
Prints out copious data at job startup and during job execution
about SHMEM internal operations.
Default: Not enabled
SMA_DBX (IRIX systems only)
Specifies the PE number to be debugged. If you set SMA_DBX to n,
PE n prints a message during program startup, describing how to
attach to it with the DBX debugger. PE n sleeps for seven
seconds. If you set SMA_DBX to n,s, PE n will sleep for s
seconds.
Default: Not enabled
SMA_INFO (Also available on SGI Altix 3000 systems)
Prints information about environment variables that can control
libsma execution.
Default: Not enabled
SMA_MALLOC_DEBUG
Activates debug checking of the symmetric heap. With this
variable set, the symmetric heap is checked for consistency upon
each invocation of a symmetric heap related routine. Setting
this variable significantly increases the overhead associated
with symmetric heap management operations.
Default: Not enabled
SMA_STATIC_VERBOSE (IRIX systems only)
Prints out information relevant to the static cross mapping
procedure at job startup.
Default: Not enabled
SMA_SYMMETRIC_VERBOSE (IRIX systems only)
Prints out information relevant to the symmetric heap
initialization at job startup.
Default: Not enabled
SMA_VERBOSE (IRIX systems only)
Prints out additional information relevant to the SHMEM startup
procedure.
Default: Not enabled
SMA_VERSION (Also available on SGI Altix 3000 systems)
Prints the libsma library release version.
Default: Not enabled
Memory Placement Related Environment Variables
On non-uniform memory access (NUMA) systems, such as Origin series
systems, SHMEM start-up processing ensures that the process associated
with a SHMEM PE executes on a processor near the memory associated
with a SHMEM PE.
On Altix systems, the available MPI memory placement environment
variables should be used.
The following environment variables allow you to control the placement
of the SHMEM application on the system:
Variable Description
PAGESIZE_DATA (IRIX systems only)
Specifies the desired page size in kilobytes for
program data areas. You must specify an integer
value. On Origin series systems, supported values
include 16, 64, 256, 1024, and 4096.
SMA_DPLACE_INTEROP_OFF (IRIX systems only)
Disables a SHMEM/dplace interoperability feature
available beginning with IRIX 6.5.13. By setting
this variable, you can obtain the behavior of
SHMEM with dplace on older releases of IRIX. By
default, this variable is not enabled.
SMA_DSM_CPULIST (IRIX systems only)
Specifies a list of CPUs on which to run a SHMEM
application. To ensure that processes are linked
to CPUs, this variable should be used in
conjunction with SMA_DSM_MUSTRUN.
For an explanation of the syntax for this
environment variable, see the section entitled
"Using a CPU List."
SMA_DSM_MUSTRUN (IRIX systems only)
Enforces memory locality for SHMEM processes. Use
of this feature ensures that each SHMEM process
will get a CPU and physical memory on the node to
which it was originally assigned. This variable
has been observed to improve program performance
on IRIX systems running release 6.5.7 and earlier,
when running a program on a quiet system. With
later IRIX releases, under certain circumstances,
setting this variable is not necessary.
Internally, this feature directs the library to
use the process_cpulink(3) function instead of
process_mldlink(3) to control memory placement.
SMA_DSM_MUSTRUN should not be used when the job is
submitted to miser (see miser_submit(1)) because
program hangs may result. By default, this
variable is not enabled.
The process_cpulink(3) function is inherited
across process fork(2) or sproc(2). For this
reason, when using mixed SHMEM/OpenMP
applications, it is recommended either that this
variable not be set, or that _DSM_MUSTRUN also be
set (see p_environ(5)).
SMA_DSM_OFF (IRIX systems only)
When set to any value, deactivates
processor-memory affinity control. When set,
SHMEM processes run on any available processor,
regardless of whether it is near the memory
associated with that process.
SMA_DSM_PPM (IRIX systems only)
When set to an integer value, specifies the number
of processors to be mapped to every memory. The
default is 2 on Origin 2000 systems. The default
is 4 on Origin 3000 systems.
SMA_DSM_TOPOLOGY (IRIX systems only)
Specifies the shape of the set of hardware nodes
on which the PE memories are allocated. Set this
variable to one of the following values:
Value Action
cube A group of memory nodes that form a
perfect hypercube. NPES/SMA_DSM_PPM
must be a power of 2. If a perfect
hypercube is unavailable, a less
restrictive placement will be used.
cube_fixed A group of memory nodes that form a
perfect hypercube.
NPES/SMA_DSM_PPM must be a power of
2. If a perfect hypercube is
unavailable, the placement will
fail, disabling NUMA placement.
cpucluster Any group of memory nodes. The
operating system attempts to place
the group numbers close to one
another, taking into account nodes
with disabled processors. (Default
for IRIX 6.5.11 and higher).
free Any group of memory nodes. The
operating system attempts to place
the group numbers close to one
another. (Default for IRIX 6.5.10
and earlier releases).
SMA_DSM_VERBOSE (IRIX systems only)
When set to any value, writes information about
process and memory placement to stderr.
Using a CPU List
On IRIX systems you can manually select CPUs to use for a SHMEM
application by setting the SMA_DSM_CPULIST shell variable. This is
treated as a comma and/or hyphen delineated ordered list, specifying a
mapping of SHMEM processes to CPUs. The shepherd process is not
included in this list.
Examples:
Value CPU Assignment
8,16,32 Place three SHMEM processes on CPUs 8, 16,
and 32.
32,16,8 Place the SHMEM process rank zero on CPU 32,
one on 16, and two on CPU 8.
8-15,32-39 Place the SHMEM processes 0 through 7 on CPUs
8 to 15. Place the SHMEM processes 8 through
15 on CPUs 32 to 39.
39-32,8-15 Place the SHMEM processes 0 through 7 on CPUs
39 to 32. Place the SHMEM processes 8
through 15 on CPUs 8 to 15.
Note that the process rank is the value returned by _my_pe(3I). CPUs
are associated with the cpunum values given in the hardware
graph(hwgraph(4)).
The number of processors specified must equal the number of SHMEM
processes (excluding the shepherd process) that will be used. If an
error occurs in processing the CPU list, the default placement policy
is used.
Using dplace(1) on IRIX Systems
The environment variables described previously allow you to map SHMEM
processes and memories with hardware processors and nodes. The
dplace(1) command, which is available on Origin series systems, can
give you additional control over application placement.
Perform the following steps to use the dplace(1) command with SHMEM
programs:
* Create file placefile with these contents:
threads $NPES + 1
memories ($NPES +1)/2 in topology cube
distribute threads 1:$NPES across memories
* Execute your program with NPES set to the number of PEs. For
example, to run with 4 PEs, invoke your program this way:
env NPES=4 dplace -place placefile a.out
NOTES
Installing SHMEM
The SHMEM software is packaged with the Message Passing Toolkit (MPT)
software product. You can find installation instructions in the MPT
relnotes. Please refer to the relnotes accompanying the toolkit. On
IRIX systems, type relnotes mpt. On SGI Altix 3000 systems, see the
README.relnotes file, which can be found by typing rpm -ql sgi-mpt |
grep README.relnotes.
Compiling SHMEM Programs
The SHMEM routines reside in libsma.so.
The following sample command lines compile programs that include SHMEM
routines:
* IRIX systems:
cc -64 c_program.c -lsma
CC -64 cplusplus_program.c -lsma
f90 -64 -LANG:recursive=on fortran_program.f -lsma
f77 -64 -LANG:recursive=on fortran_program.f -lsma
* IRIX systems with Fortran 90 version 7.2.1 available:
f90 -64 -LANG:recursive=on -auto_use shmem_interface
fortran_program.f -lsma
* SGI Altix 3000 systems:
cc c_program.c -lsma
f77 fortran_program.f -lsma
efc fortran_program.f -lsma
The shmem_interface module is intended for use only with the -auto_use
option. This module provides compile-time checking of interfaces.
The keyword=arg actual argument format is not supported for SHMEM
subroutines defined in the shmem_interface procedure interface module.
The IRIX N32 ABI, selected by the -n32 compiler option, is also
supported by SHMEM, but is recommended only for small process counts
and program memory sizes, due to the limitation in the size of virtual
addresses imposed by the N32 ABI. The use of the N64 ABI, selected by
the -64 compiler option, is recommended for most SHMEM programs.
Running SHMEM Programs
On IRIX systems, SHMEM programs are run with the NPES environment
variable set to the number of processes desired, as in the following
example:
env NPES=32 ./a.out
On SGI Altix 3000 systems, SHMEM is layered on MPI infrastructure.
Programs are started with an mpirun command, as in the following
examples:
mpirun -np 32 ./a.out
mpirun hostA, hostB -np 16 ./a.out
SHMEM Support for SGI Altix 3000
On Altix systems, the SHMEM API is supported for SHMEM programs that
run on a single host, as well as SHMEM programs that span multiple
partitions connected via NUMAlink. SHMEM functions can be used to
communicate with processes running on the same or different
partitions.
On Altix, SHMEM support is layered on MPI infrastructure. The MPI
memory mapping feature, which is enabled by default, is required for
SHMEM support on Altix. In addition, the xpmem kernel module must be
installed on the system to support SHMEM. The xpmem module is
released with the OS.
SHMEM programs on Altix are started with an mpirun command, which
determines the number of processing elements (PEs) to launch. The
SHMEM program can still call the start_pes routine to initialize the
PEs, but the actual number of PEs created is determined by the -np
option on the mpirun command line.
MPI Interoperability
SHMEM routines can be used in conjunction with MPI message passing
routines in the same application. Programs that use both MPI and
SHMEM should call MPI_Init and MPI_Finalize but omit the call to the
start_pes routine. SHMEM PE numbers are equal to the MPI rank within
the MPI_COMM_WORLD environment variable. Note that this precludes use
of SHMEM routines between processes in different MPI_COMM_WORLDs. MPI
processes started using the MPI_Comm_spawn function, for example,
cannot use SHMEM routines to communicate with their parent MPI
processes.
On IRIX clustered systems, or when an MPI job involves more than one
executable file, you can use SHMEM to communicate only with processes
running on the same host and running the same executable file. Use
the shmem_pe_accessible function to determine if a remote PE is
accessible via SHMEM communication from the local PE.
On Altix partitioned systems, when running with a single executable
file, you can use SHMEM to communicate with processes running on the
same or different partitions. Use the shmem_pe_accessible function to
determine if a remote PE is accessible via SHMEM communication from
the local PE.
When running an MPI application involving multiple executable files on
Altix, one can use SHMEM to communicate with processes running from
the same or different executable files, provided that the
communication is limited to symmetric data objects. It is important
to note that static memory, such as a Fortran common block or C global
variable, is symmetric between processes running from the same
executable file, but is not symmetric between processes running from
different executable files. Data allocated from the symmetric heap
(shmalloc or shpalloc) is symmetric across the same or different
executable files. Use the shmem_addr_accessible function to determine
if a local address is accessible via SHMEM communication from a remote
PE.
Note that on Altix, the shmem_pe_accessible function returns TRUE only
if the remote PE is a process running from the same executable file as
the local PE, indicating that full SHMEM support (static memory and
symmetric heap) is available.
When using SHMEM within MPI, one should use the MPI memory placement
environment variables when using non-default memory placement options.
SHMEM and Thread Safety
None of the SHMEM communication routines, including shmem_ptr should
be considered to be thread safe. When used in a multithreaded
environment, the programmer should take steps to ensure that multiple
threads in a PE cannot simultaneously invoke SHMEM communication
routines.
SHMEM and Cache Coherency
The SHMEM library was originally developed for systems that had
limited cache coherency memory architectures. On those architectures,
it is at times necessary to handle cache coherency within the
application. This is not required on IRIX or Altix systems because
cache coherency is handled by the hardware.
The SHMEM cache management functions were retained for ease in porting
from these legacy platforms. However, their use is no longer
required.
Note that cache coherency does not imply memory ordering, particularly
with respect to put operations. In cases in which the ordering of put
operations is important, one must use either the memory ordering
functions shmem_fence or shmem_quiet, or one of the various barrier
functions.
SHMEM Program Start-up and Memory Usage
On IRIX, starting with SHMEM 4.0 (distributed as part of MPT 1.7)
substantial changes have been made to the procedures for making static
memory remotely accessible (static cross mapping) and for managing the
symmetric heap. These changes impact the way in which SHMEM
applications interact with IRIX.
To reduce application startup times, certain procedures that were
previously done at job startup time are now deferred until required by
the application's SHMEM communication requests. Thus, the first
invocation of a communication request for a particular PE might be
relatively slow compared to subsequent requests associated with this
PE. If this behavior is undesirable, the SMA_STATIC_PREATTACH
environment variable can be set. If the symmetric heap is employed by
the application, the SMA_SYMMETRIC_PREATTACH environment variable can
also be set.
A second consequence of delaying these procedures until required by
the application is the apparent resident set size of a PE. This is
related to the manner in which IRIX accounts for memory usage
associated with system V shared memory objects. When page table
sharing is enabled for the object, a process attaching to the object
is charged only for a fraction of the memory actually associated with
the object. However, if page tables are not shared, the process'
resident set size increases by the number of pages currently
associated with the shared memory object, regardless of whether this
process has accessed these pages. If no pages have yet been faulted
in at the time of attachment, there is no significant increase in the
resident set size of the attaching process.
As a consequence of this accounting procedure, a PE might appear to
have a very large resident set size when the following conditions
occur:
* Data in the static region of the application is used as the target
of a shmem communication routine.
* The target PE has initialized a substantial portion of its static
region.
* Because of size and alignment constraints, the static region cannot
use shared page tables (less than 32 MB).
* A many-to-many or all-to-all communication pattern is used.
Although the resident set size for each PE can grow to be very large
when all four of these conditions are met, this should generally not
be a problem. If for some reason, this apparent large resident set
size is undesirable, the SMA_STATIC_PREATTACH environment variable can
be set. However, this might substantially increase job startup time.
Older versions of SHMEM use large mapped files to render static memory
remotely accessible on IRIX systems. This is also the case for this
version of SHMEM if the SMA_STATIC_METHOD environment variable is set
to mmap. The result is that enough file space must be available in
/var/tmp to accommodate a file of size npes * staticsz, where npes is
the number of PEs and staticsz is the size of the program's static
data area. Static data includes Fortran common blocks and C/C++
static data.
If a SHMEM program's memory requirements exceed available file space
in /var/tmp, a SHMEM run-time error message is generated. You can use
the TMPDIR environment variable to select a directory in a file system
with sufficient file space.
To minimize SHMEM program start-up time, use symmetric memory
allocated by the SHPALLOC(3F) or shmalloc(3C) routines instead of
static memory. Memory allocated by these routines does not require a
corresponding file space allocation in /var/tmp. This avoids problems
when file space is low. Starting with IRIX 6.5.18, using symmetric
heap memory is also preferred as shared page tables can be used. This
reduces kernel memory requirements and avoids some of the potential
problems with PE resident set size discussed above.
An additional consequence of techniques used by SHMEM to render static
memory remotely accessible relates to logical swap requirements for
SHMEM jobs. With the current static cross mapping procedure, the
logical swap reservation required for an NPES SHMEM job is
approximately 2 * npes * staticsz where staticsz is the size of the
program's static data area. Additional logical swap reservation might
be required for the symmetric heap.
EXAMPLES
Example 1. The following Fortran SHMEM program directs all PEs to sum
simultaneously the numbers in the VALUES variable across all PEs:
PROGRAM REDUCTION
REAL VALUES, SUM
COMMON /C/ VALUES
REAL WORK
CALL START_PES(0)
VALUES = MY_PE()
CALL SHMEM_BARRIER_ALL ! Synchronize all PEs
SUM = 0.0
DO I = 0,NUM_PES()-1
CALL SHMEM_REAL_GET(WORK, VALUES, 1, I) ! Get next value
SUM = SUM + WORK ! Sum it
ENDDO
PRINT*,'PE ',MY_PE(),' COMPUTED SUM=',SUM
CALL SHMEM_BARRIER_ALL
END
Example 2. The following C SHMEM program transfers an array of 10
longs from PE 0 to PE 1:
#include <mpp/shmem.h>
main()
{
long source[10] = { 1, 2, 3, 4, 5,
6, 7, 8, 9, 10 };
static long target[10];
start_pes(0);
if (_my_pe() == 0) {
/* put 10 elements into target on PE 1 */
shmem_long_put(target, source, 10, 1);
}
shmem_barrier_all(); /* sync sender and receiver */
if (_my_pe() == 1)
printf("target[0] on PE %d is %d\n", _my_pe(), target[0]);
}
SEE ALSOdplace(1)
The following man pages also contain information on SHMEM routines.
See the specific man pages for implementation information.
shmem_add(3), shmem_and(3), shmem_barrier(3), shmem_barrier_all(3),
shmem_broadcast(3), shmem_cache(3), shmem_collect(3), shmem_cswap(3),
shmem_fadd(3), shmem_fence(3), shmem_finc(3), shmem_get(3),
shmem_iget(3), shmem_inc(3), shmem_iput(3), shmem_lock(3),
shmem_max(3), shmem_min(3), shmem_my_pe(3), shmem_or(3),
shmem_prod(3), shmem_put(3), shmem_quiet(3), shmem_short_g(3)shmem_short_p(3), shmem_sum(3), shmem_swap(3), shmem_wait(3),
shmem_xor(3), shmem_pe_accessible(3), shmem_addr_accessible(3),
start_pes(3)shmalloc(3C)shpalloc(3F)MY_PE(3I), NUM_PES(3I)
For information on using SHMEM routines with message passing routines,
see the Message Passing Toolkit: MPI Programmer's Manual.