Study of ELF loading and relocs

Modified Aug 3/99
Pat Beirne

Copyright Pat Beirne 1999
All rights reserved


In the old days, applications were build by compiling many .c files into .o files. These files often had inter-related references that weren't resolved at compile time. The information on these references are stored within the .o files in a reloc (relocation) object.

Later, at link time, the linker would merge all the .o files, building a table of where symbols are ultimately located. Then the linker would run through the set of relocs, filling them in.

A reloc consists of three parts:

The most interesting part of this paper is the latter element. The algorithm can be as simple as "use the symbol memory location; store it in binary" (R_386_32. Or it may be more complicated, such as "calculate the distance from here to the symbol, divide by 4, subtract 2 and add the result to the 3 lower bytes" (R_ARM_PC26).

These relocs are scattered through the .o files, and are used at link time create the correct binary executable file. Once all the relocs are resolved, the linker is pretty well done its job.

At least this is the way things used to work, in the days of static linking.

With the introduction of run-time linking, the designers of the ELF format decided that relocs are a suitable entity to hold run-time resolution information. So now we have executable files which still have relocs in them, even after linking.

However, new algorithms are required to signal how these fixups are to be done. Hence the introduction of a new family of reloc numbers (i.e. algorithms)

Relocs and Memory Space

What is a reloc? Binary executables often need certain bits of information fixed up before they execute. ELF binaries carry a list of relocs which describe these fixups. Each reloc contains:
    the address in the binary that is to get the fixup (offset)
    the algorithm to calculate the fixup (type)
    a symbol (string and object len)
At fixup time, the algorithm uses the offset & symbol, along with the value currently in the file, to calculate a new value to be deposited into memory. See Appendix B

One of the targets of the ELF binary system is a separation of code and data. The code of apps and libraries is marked read-only and executable. The data is marked read-write, and not-executable.

The code is read-only so that multiple processes can use the code, having loaded the code into memory only once. Each process has its own page tables, mapping the code into its own memory. The code is never modified, and appears identical in each process space. Naturally, the code must be position independent; each process can load the app into a different address.

The code segment is allowed to contain constant pointers and strings (.rodata).

The data segment is read-write and is mapped into each process space differently. [In Linux, each data segment is loaded from the same base mmap, but it is marked copy-on-write; after the first write, each process has its own copy of the data.] Therefore, relocs can only point to the data segment.

This half-and-half nature of ELF binaries leads us to an interesting design point. Some of the relocs that we wish to make are in the data segment. These are easy to do: we can add relative offsets, or write absolute addresses with no problem. But the fixups in the code area are more difficult. The ELF reloc design forces us to make the code relocs "bounce off" an entry in the data area, known as the GOT (global offset table).

In other words, if code needs to refer to a global object, it instead refers to an entry in the GOT[], and at run-time, the GOT entry is fixed-up to point to the intended data. In this manner, the code space need never be fixed-up at run time. If the code needs to refer to a local object, it refers to it "relative to the &GOT[0]"; this too is position independent. NOTE 1

If the code needs to jump to a subroutine in a different module, the linker creates an array of jump-stubs, called the PLT (procedure linkup table). These jump-stubs jump indirect, using an entry in the GOT[] to implement the far call.

Finally, ELF implements run time linking by deferring function resolution until the function is called. This means that calls to library functions go through a fixup process the first time that they are called.

The rest of this paper explains the operation of these concepts.

NOTE 1: Relative (GOTOFF) code is made "relative to the start of the GOT table". Instead, it could have been made "relative to the load address of the module", which would have been cleaner in my opinion. But there are reasons that some architectures chose the former, so we'll stick with it.

Reloc Design

Relocs are used in many places in the design cycle:
  1. in .o files intended for executables
  2. in .o files intended for shared libraries
  3. in executables
  4. in shared libraries (.so files)
1) Object files need to be able to reference external symbols. In modern architectures, we can usually get away with
1-i) relative, from "here" to a symbol (R_*_PC32), used for branches
1-ii) abolute, to a symbol (R_*_32) NOTE 2

2) Object files which are going to be part of a library are a little different. For one thing, they must be compiled as PIC code, using the -fpic flag. Next, there must be a distinction between local data/functions and global data/functions. Finally, relocs in the code/.rodata sections must use got-based relocs, because the code/.rodata area of the final libary file cannot be modified at run time. The relocs are:

in code:
2-i) reference to local symbol: use the relative distance from the GOT to the local symbol (R_*_GOTOFF); these relocs can exist in the code area, because they will be fully resolved at link time
2-ii) reference to a global symbol: create an entry in the GOT and let the run-time system deposit the symbol's address into the GOT for us (R_*_GOT32)
2-iii) In addition, relative calls to subroutiine (R_*_PC32) can be used.

in data:
2-iii) reference to symbol (R_*_32) [NOTE: symbols which are global have a reloc that references the symbol by name; symbols which are local can have a reloc that simply references the section number, and have a section-offset contained in the reloc. See NOTE 2]

3) Executables need to be able to refer to global data (such as errno) as if there is only one copy. ELF systems do this by copying global symbols down into the application .bss space. Then the executable and all the libraries point to this single copy. To realize this, we need relocs:
3-i) reach into a library to a symbol and copy down the data into our own .bss space (R_*_COPY)
3-ii) pointer to global data (R_*_GLOB_DAT)
3-iii) pointer to library function (R_*_JMP_SLOT)
Notice that all of these relocs must modifiy only the data section of the executable; the code section is read-only! All the relocs from the .o file have either been resolved, or mutated into one of the above 3.

4) Shared libraries are the most complex. By the time the library is linked, all the R_*_GOTOFF relocs (from the .o files) are resolved.
4-i) All the R_*_GOT32 relocs are resolved, pointing at GOT entries. At link time, these GOT entries get relocs of their own, pointing to the global data/function. (R_*_GLOB_DAT/R_*_JMP_SLOT respectively).
4-ii) There will be times when local data structures need to hold absolute pointers to local data. Put the module-relative address of the symbol in the library; at run-time, add the module-load address to it (R_*_RELATIVE)

Again, notice that all of these relocs must modifiy only the data section of the executable; the code section is read-only!

When the linker creates 3) and 4) above, the linker actually creates code and data that was not explicit in the .o files. There is a .plt section created in the code segment, which is an array of function stubs used to handle the run-time resolution of library calls. There is a .got section created in the data segment, which holds pointers to global symbols. Both of these synthetic sections are "helpers" to the code segment, since the code segment cannot be modified at run-time.

To make all this happen, the object files must contain information about whether a symbol is global or local, function or data, and the object size. (The old a.out scheme did not require all this extra info)

NOTE 2 At this point, I'll mention that global relocs must neccessarily involve the three aspects of a reloc:

However, if the symbol is local, and can be fixed in memory with respect to a memory "section", the object file is allowed to drop the symbol name, and replace it with a section-plus-offset.

For instance, in this ARM code

        .section .text
         mov     r0, r0     @sample code 
.L2:     call    _do_something
         ldr     r6, .L3     @this code need a reloc!
         mov     r0, r0 
.L4:    .word    Lextern 
.L3:    .word   .L2       @this read-only data needs a reloc

The code on the 3rd line (the call) needs to be fixed up, but that's easy, since it's a PC relative fixup.

If the .o file has no idea where .Lextern is, it must neccessarily create a reloc which refers to
symbol Lextern.

.L4     .word   0         R_ARM_32        Lextern

The word at .L3 needs a fixup as well. If the .o file can determine the location of a local symbol, such as L2, then it is allowed to replace the symbol with a section-plus-offset.  The offset is stored in the reloc target address, and the section is an entry in the reloc symbol table

.L3     .word   4         R_ARM_32        .text
This reduces the number of symbols in the symbol table, making run-time linking easier.

NOTE 3 Notice that the R_*_GOTOFF and R_*_GOT32 relocs include an offset from &GOT[0}, which is usually about halfway through the module. The R_ARM_RELATIVE relocs, on the other hand, contains an offset from the beginning of the module. Why? Tradition.

Jump Tables

As much as possible, ELF dynamic linking defers the resolution of jump/call addresses until the last minute. The technique is based on the following constraints. To do this, there are three steps involved in a typical far jump:
  1. start in the code
  2. go through the PLT
  3. using a pointer from the GOT
[Before explaining how this works, I should tell you that the GOT entries that are used for PLT execution are preloaded to default addresses.]

1) In the code:

        call    function_call_n
This is typical code using the relative jump or call. The target is an entry in the PLT. Note that this call is identical to a normal call.

2) In the PLT: The PLT is a synthetic area, created by the linker. It exists in both executables and libraries. It is an array of stubs, one per imported function call.

On i386 architecture, this code looks like:

PLT[n+1]: jmp    *GOT[n+3]
          push   #n        @push n as a signal to the resolver
          jmp    PLT[0]
A subroutine call to PLT[n+1] will result jumping indirect through GOT[n+3]. When first invoked, GOT[n+3] points back to PLT[n+1]+6, which is the push/jmp sequence. Going through the PLT[0], the resolver uses the argument on the stack to determine 'n' and resolves the symbol 'n'. The resolver code then repairs GOT[n+3] to point directly at the target subroutine.

The first PLT entry is slightly different, and is used to form a trampoline to the fixup code.

PLT[0]: push    &GOT[1]
        jmp     GOT[2]     @points to resolver()
Flow is directed to the resolver routine. 'n' is already on the stack, and &GOT[1] gets added on the stack.  This way the resolver (located in can determine which library is asking for its service.

3) In the GOT: The GOT (global offset table) contains helper pointers for both PLT fixups and GOT fixup. The first 3 entries are special/reserved. The next M entries belong to the PLT fixups. The next D entries belong to various data fixups.

The GOT is a synthetic area, created by the linker. It exists in both executables and libraries.

When the GOT is first set up, all the GOT entries relating to PLT fixups are pointing to code back at PLT[0].

The special entries in the GOT are
       GOT[0] = linked list pointer used by the dyn-loader
       GOT[1] = pointer to the reloc table for this module
       GOT[2] = pointer to the fixup/resolver code, located in the library
followed by
        GOT[3] .... GOT[3+M] = indirect function call helpers, one per imported function
        GOT[3+M+1] ...... GOT[end] = indirect pointers for global data references, one per imported global

Remember that each library and executable gets its own PLT and GOT array.

Memory & Load Addresses

In a typical Linux system, the addresses 0-3fff.ffff (3 gigs) are available for the user program space.

Exectuable binary files include header information that indicates a load address. Libraries, because they are position-independent, don't need a load address, but contain a 0 in this field.

Start Len Usage
0 4k zero page
0000.1000 128M not used
0800.0000 896M app code/data space
followed by small-malloc() space
4000.0000 1G mmap space
library load space
large-malloc() space
8000.0000 1G stack space
working back from BFFF.FFE0

The kernel has a preferred location for mmap data objects, at 0x4000.0000. Since the libraries are loaded by mmap, they end up here.

The library that most of us are using for malloc (GLIBC) handles small mallocs by calling sys_brk(), which extends the data area after the app, at 0x0800.0000+sizeof(app). Large mallocs are realized by creating a mmap, so these end up in the pool at 0x4000.0000.

As the mmap pool grows upward, the stack grows downward. Between them, they share 2G bytes.

The shared library design usually has the app loading first, then the loader notices that it need support, and loads the dyn-loader library (usually /lib/ at 0x4000.0000. Other libraries are loaded after You can see where libraries will load by using the utility ldd
        ldd foo_app

There is a diagnostic case where the app is invoked by
        /lib/ foo_app foo_arg ....
In this case, the is loaded as an app. Since it was built as a library, it tries to load at 0. [In ArmLinux, this is forbidden, so the kernel pushes it up to 0x1000.] Once loads, it reads it argv[1] and loads the foo_app at its preferred location (0x0800.0000). Other libraries are loaded up a the mmap area. So, in this case, the user memory map appears as
Start Len Usage
0 128M
followed by small-malloc() space
0800.0000 896M app code/data space
4000.0000 1G mmap space
lib space
large-malloc() space
8000.0000 1G stack space,
working backward from BFFF.FFE0

Notice that the small malloc space is much smaller in this case, but this is supposed to be for load-testing and diagnostics, so it's not too bad.

Please, if you need more text, let me know:

Appendix A: Relocs in i386

Here is some analysis of the i386 design:


in .o files; these are the old relocs......
Reloc Meaning
R_386_32 simply deposit the absolute memory address of "symbol" into a dword
R_386_PC32 determine the destinance from this memory location to the "symbol", then add it to the value currently at this dword; deposit the result back into the dword

These four were introduced with dynamic libraries; they are found only in .o files which are going to be part of a library (pic code):
R_386_GOT32 this reloc is going to persist through the link stage
the linker should mutate this into a R_386_GLOB_DATA in the library
R_386_GOTPC determine the distance from here to the GLOBAL_OFFSET_TABLE (&GOT[0]) and deposit the difference as a dword into this location (does not involve a symbol!)
used in function prolog to calculate &GOT[0]
R_386_GOTOFF determine the distance from the GLOBAL_OFFSET_TABLE to the (local) "symbol"; store said distance in the dword at this location; create an entry in the GOT[]; change this reloc into a R_386_RELATIVE and point it at the GOT[] entry
R_386_PLT32 create a new entry in the PLT[] and GOT[]
determine the distance from here to the PLT[] entry and store that distance as a dword at this location
at final link, rename the reloc to a R_386_JMP_SLOT, keeping the same "symbol" and point it at the GOT[] entry

Executable files that are built "static" have no relocs in them. They run standalone.

In executable files which are intended to run with shared libraries......
R_386_JMP_SLOT at dynamic link time, deposit the address of "symbol" (a subroutine) into this dword 
R_386_COPY read a string of bytes from the "symbol" address and deposit a copy into this location; the "symbol" object has an intrinsic length
i.e. move initialized data from a library down into the app data space

Dynamic library files also have R_386_JMP_SLOT relocs, plus
R_386_GLOB_DATA at load time, deposit the address of "symbol" into this dword; the "symbol" is in another module
this reloc is, in a sense, the complement of the R_386_COPY above
R_386_RELATIVE at dynamic link time, read the dword at this location, add it to the run-time start address of this module; deposit the result back into this dword

Note that R_386_32 relocs can appear in libraries as well. These must be executed carefully!

R_386_COPY and R_386_GLOB_DATA can be considered complements of each other. Suppose you have a global data object defined in a dynamic library. The library will have the binary version of the object in its .data space. When the application is built, the linker puts a R_386_COPY reloc in there to copy the data down to the application's .bss space. In turn, the library never references the original global object; it references the copy that is in the application data space, through a corresponding R_386_GLOB_DATA. Wierd, huh? After loading and copying, the original data (from the library) is never used; only the copy (in the app data space).

To make the whole dynamic linking operation happen, the linker introduces several "synthetic" constructs into the target when you build an app or a library:
.got == &GOT[0} Global Offset Table  a small section of data memory where run-time fixups are made; there is only one of these per-app or per-library
GLOBAL_OFFSET_TABLE a pointer to the .got
.plt == &PLT[0] Procedure Lookup Tbl a small section of code memory which helps the run-time resolution work properly

The compiler can signal to the assembler that it wants to trigger one of the above constructs by:
implicit func i386 syntax ARM syntax
.got pointer var@GOT(%ebx) var(GOT)
.got data var@GOTOFF(%ebx) var(GOTOFF)
.plt jump func@PLT func(PLT)

Note that the C/C++ programmer does not allocate this memory; it is created by, and used by the linker.

To make the job of the linker a bit easier, the relocs are clustered together in the app-file or the library-file.
.rel.bss section contains all the R_386_COPY relocs
.rel.plt section contains all the R_386_JMP_SLOT relocs
these modify the first half of the GOT elements section contains all the R_386_GLOB_DATA relocs
these modify the second half of the GOT elements section contains all the R_386_32 and R_386_RELATIVE relocs

Appendix B: Typical reloc list

Relocs are usually of type Rel:
struct Rel {
    uint32    r_offset;
    uint24    r_sym_index;
    uint8       r_type;
enum for t_type {

Here is an excerpt from the reloc list for my version of /usr/bin/dir

Relocation section '' at offset 0xb6c contains 1 entries:
  Offset    Info  Type            Symbol's Value  Symbol's Name
  08054748  00106 R_386_GLOB_DAT        00000000  __gmon_start__
Relocation section '.rel.bss' at offset 0xb74 contains 8 entries:
  Offset    Info  Type            Symbol's Value  Symbol's Name
  08054800  04405 R_386_COPY            08054800  __ctype_tolower
  08054804  00605 R_386_COPY            08054804  stdout                   
  08054808  03505 R_386_COPY            08054808  stderr                   
  0805480c  01905 R_386_COPY            0805480c  __ctype_toupper          
  08054810  01105 R_386_COPY            08054810  _nl_msg_cat_cntr         
  08054814  00905 R_386_COPY            08054814  __ctype_b                
  08054818  01405 R_386_COPY            08054818  optarg                   
  0805481c  02205 R_386_COPY            0805481c  optind
Relocation section '.rel.plt' at offset 0xbb4 contains 58 entries:
  Offset    Info  Type            Symbol's Value  Symbol's Name
  08054660  00e07 R_386_JUMP_SLOT       08048dc4  readlink                 
  08054664  03c07 R_386_JUMP_SLOT       08048dd4  getgrnam                 
  08054668  02407 R_386_JUMP_SLOT       08048de4  ferror                   
  0805466c  04107 R_386_JUMP_SLOT       08048df4  strchr                   
  08054670  01007 R_386_JUMP_SLOT       08048e04  __overflow               
  08054674  04507 R_386_JUMP_SLOT       08048e14  __register_frame_info    
  08054678  01f07 R_386_JUMP_SLOT       08048e24  _obstack_begin           
  0805467c  02b07 R_386_JUMP_SLOT       08048e34  fnmatch                  
  08054680  02907 R_386_JUMP_SLOT       08048e44  localtime                
  08054684  02f07 R_386_JUMP_SLOT       08048e54  strcmp
When the reloc algorithm is invoked, it has direct access to:
    the reloc target location
    the symbol name
and, indirectly, it has access to:
    the data currently at the target location (sometimes called the addend)
    the object length, through the symbol description
The algorithm must make all its calculations based on these 4 pieces of data.

Some architectures, like the M68k use a different reloc, called Rela, which has one extra parameter, called an addend. This makes the relocs 12 bytes each, instead of 8. The Rel is just as flexible (in my opinion :)

Example code

This appendix will show C code which triggers the new relocs.
Suppose we have this libary code:

typedef struct {
    char* p;
    char (*f)(int);
} _st;

char fPub(int a) {return 'a';}
static char fLocal(int a) {return 'b';}

static char cLocal;
char cPub;

_st a[] = { {&cLocal,   // 1
    fLocal},            // 2
    {&cPub,             // 3
    fPub} };            // 4

int foo(int a) {        // 5
    return fPub(a)      // 6
        + fLocal(a)     // 7
        + (int) &cPub   // 8
        + cPub          // 9
        + (int) &cLocal // 10
        + cLocal;       // 11
When the compiler builds the .o files, lines 1 and 2 are marked as needing a full 32 bit address; R_386_32 relocs are generated. But the address can be determined locally, so the symbols can be dropped and offsets used instead. Lines 3 & 4 will also generate a R_386_32 reloc, requesting a full absolute address, to be associated with the symbols "cPub" and "fPub".

Line 5 publishes a function foo() as a public symbol. Since it can be called from outside, and it needs to be position independent (-fpic), it needs to generate a local reference to the GOT. Early in the prolog in foo(), the compiler will generate something like:
    mov     &GOT[0], %ebx
so that the rest of the subroutine has a reference to &GOT[0] for further processing. Note that line 5 requires &GOT[0], which itself requires a reloc: R_386_GOTPC, meaning "the distance from here to the GOT". This extra reloc is the overhead of each public function that is compiled -fpic.

Line 6 will trigger a R_386_PLT32 reloc, using the symbol "fPub". Line 7 also generates the same reloc, against the symbol "fLocal". (This reloc will disappear at the final link.)

Line 8 requires the address of a public object. This object location has to be flexible at run time, so a R_386_GOT32 reloc is used. Later, at link time, this will create an address slot in the GOT[].

Line 9 requires the "contents of" the same object. The object contents are fetched by using the address contained in the GOT[] entry. [Note that compiler is smart enough to use a single reloc to realize both line 8 and line 9]

Lines 10 & 11 require the address and contents of a local object. We don't know exactly where this object is going to be a run time, but we do know that it is local, and we can state its position relative to the &GOT[0]. So a pair of R_386_GOTOFF relocs are generated. [Again, the compiler may merge lines 10 and 11 into a single reloc, but it didn't when I built this example.]

[Because of the structure of the linker, full name resolution isn't checked until a link is made with an executable. In other words, if your library has unresolved references, you won't find out about it until you try to make an app using your library.]

To summarize, the .o file contains the following relocs:
in the data section:

Relocation section '' at offset 0x470 contains 4 entries:
  Offset    Info  Type            Symbol's Value  Symbol's Name
  00000000  00401 R_386_32              00000000  .bss                     
  00000004  00201 R_386_32              00000000  .text                    
  00000008  00c01 R_386_32              00000001  cPub                     
  0000000c  00a01 R_386_32              00000000  fPub
and in the code section:
Relocation section '.rel.text' at offset 0x440 contains 6 entries:
  Offset    Info  Type            Symbol's Value  Symbol's Name
  00000028  00e0a R_386_GOTPC           00000000  _GLOBAL_OFFSET_TABLE_    
  00000031  00a04 R_386_PLT32           00000000  fPub                     
  0000003a  00604 R_386_PLT32           0000000c  fLocal                   
  00000049  00c03 R_386_GOT32           00000001  cPub                     
  00000057  00409 R_386_GOTOFF          00000000  .bss                     
  0000005e  00409 R_386_GOTOFF          00000000  .bss

Later, when the linker is called for the final link stage, the relocs at line 1 & 2 will mutate into a R_386_RELATIVE. The reloc target location is loaded with the offset of the information. At run-time, the dyn-linker will add the module address to the offset and deposit the result back into the reloc target. The symbol is not used in this case.

Lines 3 & 4 remain as R_386_32 relocs, and will ask the dyn-linker for the full 32 bit absolute address to be deposited into the reloc target.

The reloc triggered by line 5 is fixed up fully, and does not appear in the library.

The reloc in line 6 will cause the linker to add a PLT entry, and a corresponding GOT entry. The latter gets a R_386_JUMP_SLOT reloc, using the symbol "fPub". [The code generated at line 6 appears to be a subroutine call into the PLT entry.]

The reloc at line 7 can be fully resolved by the final linker stage, so it is transformed into a direct call to fLocal().

The reloc at line 8 and 9 will cause the linker to add a GOT entry, which will hold &cPub. The GOT entry gets marked with a R_386_GLOB_DAT reloc, asking the dyn-linker for the full 32 bit abolute address.

The relocs at line 10 & 11 can be fully resolved at final link time. They turn into "find the data at &GOT[0] plus this offset", so no reloc is required.

As you can see, the 10 relocs in the .o file turn into 4 in the library. Also, the PLT gets a new entry, and the GOT gets two new entries.

Now suppose we have an executable which references the library:
extern int fPub(int);
extern int cPub;
int main() {
        return fPub(123)    // 1
        + cPub;             // 2
When the .o file is created, there is a R_386_PC32 generated for "fPub" and a R_386_32 generated for the "cPub".

When the executable is created, the R_386_PC32 from line 1 will cause an entry in the PLT, and the code will call into the PLT. At the same time, the linker will create an entry in the GOT, which the PLT will jump through. The GOT entry will get a R_386_JUMP_SLOT reloc, using the symbol "fPub".

The data reference in line 2 will cause a local copy of the global cPub to be created in the data space of the app. The data reference at line 2 is changed to point to this new global data, and the reloc is resolved. This new global gets a R_386_COPY reloc, using the symbol "cPub". The symbol has certain properties, including the fact that references data, and that it has a length of 1 byte. At run time, the dyn-linker will find the symbol cPub in one of the libraries and copy the 1 byte down from the library into the app data space. The dyn-linker will then publish that latter address as the address of "cPub".

Request for comments

I think the above is exhaustive for the i386. Please let me know if I missed something, or if the presentation should be changed. Pat Beirne