Thread Local Storage in Object Files

I was rummaging through some generated x86-64 assembly compiled by Clang and Intel's C compiler (ICC), and I noticed something strange. In Clang generated code, there are a large number of loads from a _DYNAMIC section.

1
6a1:	lea 0x2002d7(%rip),%rdi        # 200980 <_DYNAMIC+0x1a8>

The assembly didn't bother me, so much as the comment. What is the _DYNAMIC section doing? I've never seen anything like it before and it made me wonder. This was especially confusing because these instructions didn't seem to occur with the code compiled by ICC. What is that Dynamic section?

At first, I thought it was simply a case of position independent code (PIC) because the load was offset from the instruction pointer (%rip). I don't remember explicitly enabling position independent code, but I thought maybe Clang compiles PIC code by default. But something else struck me as odd. The ICC generated code had loads relative from %rip as well, but no comment. Maybe I just forgot to build the binary with debug information and that would add the _DYNAMIC comment?

I recompiled the binary with debug information with ICC but that didn't do it either. I knew that there was probably some issue with Thread Local Storage because I've experienced it before. I set out to discover what it was. Let's have some simple C code with a thread local variable and see what happens.

1
2
3
4
5
6
__thread int _threadInt; 

int foo() {
    printf("Thread int is: %d\n", _threadInt);
    return 0;
}

Clang generates the following assembly code:: 

1
2
3
4
5
6
7
8
 push   %rax
 mov    %fs:0xfffffffffffffffc,%esi
 mov    $0x400658,%edi
 xor    %al,%al
 callq  4003f0 <printf@plt>
 xor    %eax,%eax
 pop    %rdx
 retq   

Here, instruction #2 is the most important. This is a thread local variable access. The generated assembly uses the initial-exec storage model, which is another blog post. Instruction #5 is the call to printf. Here the assembly makes sense. We just load the thread local variable and pass it as a parameter into printf. (X86-64 passes parameters in registers). But there's still no load from _DYNAMIC! What's going on? I read more about the DYNAMIC section in ELF files [1], trying to figure out where the thread local variable is allocated. Let's take a look at the DYNAMIC section in the object file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
objdump -x main.o 

Dynamic Section:
  NEEDED               libc.so.6
  NEEDED               ld-linux-x86-64.so.2
  INIT                 0x0000000000000568
  FINI                 0x0000000000000718
  GNU_HASH             0x00000000000001b8
  STRTAB               0x0000000000000360
  SYMTAB               0x00000000000001f8
  STRSZ                0x00000000000000b6
  SYMENT               0x0000000000000018
  PLTGOT               0x0000000000200990
  PLTRELSZ             0x0000000000000060
  PLTREL               0x0000000000000007
  JMPREL               0x0000000000000508
  RELA                 0x0000000000000478
  RELASZ               0x0000000000000090
  RELAENT              0x0000000000000018
  VERNEED              0x0000000000000438
  VERNEEDNUM           0x0000000000000002
  VERSYM               0x0000000000000416
  RELACOUNT            0x0000000000000001

Here the dynamic section refers to elements that are required during dynamic linking. However, looking at this, we see some sections in ELF describing what's needed at runtime. We need libc for printf. We also see a symbol table (Item 10), but still nothing here that tells us how the thread local variable is accessed. In addition, the generated code doesn't have a comment indicating a load from a DYNAMIC section. But this gave me an idea, what if we made the thread local variable extern so the linker would have to resolve the address later? And since it should be in the dynamic section, let's make the code a shared library. Let's write some code:

1
2
3
4
5
6
7
8
clang -fpic main.c -O -shared -o main.o

extern __thread int _threadInt;

int foo() {
    printf("Thread int is: %d\n", _threadInt);
    return 0;
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
push   %rax
lea 0x20031f(%rip),%rdi        # 2009a8 <_DYNAMIC+0x1a8>
callq 598 <__tls_get_addr@plt>
mov    (%rax),%esi
lea    0xac(%rip),%rdi        # 746 <_fini+0xe>
xor    %al,%al
callq  578 <printf@plt>
xor    %eax,%eax
pop    %rdx
retq   
nopw %cs:0x0(%rax,%rax,1)

Alright, finally! We have a reference to the DYNAMIC section again! Why did this happen? Compilers can optimize Thread Local Storage using different thread local storage models. One optimization is if the compiler can determine that the thread local variable is only referenced within the executable, a level of indirection can be optimized away, hence the lack of a _DYNAMIC access. By making the binary a shared library, we prevented the compiler from performing this optimization. So now that we finally have a reference to _DYNAMIC, what is it actually referencing? We see something interesting about instruction #3, a call to tls_get_addr@plt. 

This is one method of accessing thread local storage [2]. The code calls a function that retrieves the address of the local variable through the Procedure Linking Table (PLT). How the PLT works is another long blog post. Essentially, thread local access uses a table to patch up relocation points to access variables. Code accesses the variable through this table. Then I thought, oh let's take a look at the dynamic relocation entries in the object file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
readelf -r main.o

Relocation section '.rela.dyn' at offset 0x478 contains 6 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
0000002007f8  000000000008 R_X86_64_RELATIVE                    00000000002007f8
000000200990  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000200998  000400000006 R_X86_64_GLOB_DAT 0000000000000000 _Jv_RegisterClasses + 0
0000002009a0  000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0
0000002009a8  000700000010 R_X86_64_DTPMOD64 0000000000000000 _threadInt + 0
0000002009b0  000700000011 R_X86_64_DTPOFF64 0000000000000000 _threadInt + 0
Aha! There it is, Line #9. If we look at the comment in the disassembled code, we see that it states an entry in the _DYNAMIC section pointing to address 0x2009a8. When we look at the dynamic relocation section, we see our thread local variable at the exact offset. 

Whew, at least now I know what the _DYNAMIC section is for. If refers to the dynamic relocation section in an ELF file, not the dynamic shared library section. Another blog post will have to be written to describe how relocation works and each thread local variable access optimization strategies.

References:
[1] Executable and Linkable Format
[2] ELF Handling for Thread Local Storage
[3] Eli Bendersky's Position Independent Code