You're reading for free via Monethic.io's Friend Link. Become a member to access the best of Medium.

Member-only story

DYLD — Do You Like Death? (XI)

19 min readMay 26, 2024

The lifecycle of a Dynamic Loader from its creation to its termination.

This is the eleventh and the last article in the series about debugging Dyld-1122 and analyzing its source code. We will learn how Dyld load dependent dylibs, bind them all together, return the address of the main(), calls it and finally terminates.

Please note that this analysis may contain some errors as I am still learning and working on it alone. No one has checked it for mistakes. Please let me know in the comments or contact me through my social media if you find anything.

Let’s go!

WORKING MAP

As last time, we begin our journey by decompiling the Dyld using a Hopper.

hopper -e '/usr/lib/dyld'

We are in the dyld`start analysing the Memory Manager. In the fourth article, I introduced pseudo-code, which you can see below:

Based on this, we finished creating the allocator later used as a memory pool for setting the Global state of the process, which consists of two types of states: fixed — ProcessConfig, and dynamic — RuntimeState.

ProcessConfig can be accessed through RuntimeState as its property.

The state object of RuntimeState class created in the last episode is like an API for querying process-related data (threads or loaded Mach-Os).

In the repository, it is even named APIs, which inherit from RuntimeState.

In the last episode, we analyzed the ExternallyViewableState which holds information about the loaded images. When initialized, it only stores info about Dyld and the executable image, but now we are going to run the prepare function that will load the rest of the dependent images (dylibs):

Dyld GitHub repository:

Start: prepare in dyld-1122.1 — dyldMain.cpp#1252
End: exit in dyld-1122.1 — dyldMain.cpp#1272

LLDB breakpoints:

# Start - dyld`start+1828
settings set target.env-vars DYLD_IN_CACHE=0
br set -n start -s dyld -R 1828
# Just before call main - dyld`start+2356
br set -n start -s dyld -R 2356
# Just before call exit - dyld`start+2432
br set -n start -s dyld -R 2432

This is the last article at the moment. However, I will maybe write something about things I ommited in these series in the future.

START — prepare

Before executing the prepare function, we set some values in registers:

The x1 stores the pointer to the beginning of the Dyld image, which is our second argument — MachOAnalyzer, while the first argument APIs is stored in the x0 register, and we can double-check it by inspecting instructions:

When we step into the prepare function, we may observe it contains twice as much of the code (4080 instructions 😰) that we saw in dyld::start:

Just to recap, here is the line from the code repository we are running:

The source code repository contains the corresponding code between lines 482–944. Based on the comment, this is the last straight in our Dyld review:

We can omit the code between lines 484–516 as it is only compiled in the context of EnclaveKit initialization. We should start the analysis at line 517:

We can also double-check our assumption in the debugger, as we can see the first instructions we run into are kdebug_trace_dyld_enabled:

Here, we will also not go into details of the kdebug system. I will go back to it in another series about XNU debugging. However, it is usually off, so we will perform a jump here to line 524 where we execute the simulator check:

In the lldb, after returning from the kdebug_trace_dyld_enabled with value 0 in x0 register, there is a CBZ instruction, and we are jumping to +132:

Then another jump is performed to 204 to the isSimulatorPlatform:

This function checks if we are running our executable in the context of any of the simulator platforms shown below:

As we are not running the process in the context of the simulator platform, we will jump over most of the code to line 563. Through this jump, we will also check if the program we are executing is built for the simulator in line 538 and if logging of environment variables is enabled in the line 554:

If we run in the simulator context, it ensures the program is correctly configured to run on a simulator platform with the appropriate DYLD_ROOT_PATH.

state.initializeClosureMode()

It initializes in RuntimeState to handle PrebuiltLoaders from the dyld cache. The function logic is well explained in PrebuiltLoaderSet_Policy.md.

PrebuiltLoaders are optimized representations of dynamic libraries used by Dyld to speed up application launch times. If the application is run for the first time, there is also JustInTimeLoader.

The above document about the PrebuilLoaderSet Policy is very informative and further explains how it worked with the dyld3 and dyld2 versions:

However, now, with the current version of dyld4, we always have only two options to rely on, and they are JustInTimeLoaders or PrebuiltLoaders:

PrebuiltLoaderSet for dyld4 is like Dyld Closure for dyld3.

The dyld4 policy point summarizes when Dyld Closures are not used and explains how the DYLD_USE_CLOSURES works for the current version of dyld4:

There is also one more constraint for PrebuildLoaderSet:

Regarding to DYLD_USE_CLOSURES there is a comment in the code:

There is also a tool for creating Dyld Closures called dyld_closure_util. Its source code is in the repository. However, it is not so trivial to compile it on a noninternal Apple environment, and I gave up on it:

The initializeClosureMode is called from the state object because the RuntimeState contains a Loader object that tracks each loaded Mach-O:

The PrebuiltLoader and JustInTimeLoader are subclasses of Loader:

Its code can be found in the repository between lines 90–355.

We can also read about Loaders in another place in the documentation:

Further about PrebuildLoader:

Finally, about the JustInTimeLoader:

The code responsible for all the stuff is between lines 2670–2842:

It starts with the initialization of some variables and then validating the header of the PrebuiltLoaderSet from the Dyld in cache in line 2677:

The source code of the validHeader logic is shown below. In our case, it returns a true value:

The hasValidMagic checks if PrebuiltLoaderSet->magic is equal to kMagic:

We can find the kmagic in the source code repository or by reading the decompiled code while debugging in the lldb (0x9a66106073703464):

After checking if the magic is valid, we execute dontUsePrebuiltForApp:

This function determines whether prebuilt loaders should be disabled based on Dyld Environment Variables and executable load commands:

After this check, we fall into another else if where we search the cache for PrebuiltLoader for the program using findLaunchLoaderSet:

If the cachePBLS was not found, and the main executable path starts with /System/, it attempts to find a PrebuiltLoaderSet using the cd-hash:

As we are not running the program from /System/ directory, we are not executing code in lines 2707–2716 and move forward to 2717:

The hasLaunchLoaderSetWithCDHash function is a simple wrapper that calls findLaunchLoaderSetWithCDHash and checks if it returns a non-null pointer:

The findLaunchLoaderSetWithCDHash function constructs a path using the provided cdHashString, ensures it is neither null, nor too long to prevent buffer overflow and then attempts to find a prebuilt loader set corresponding to this path using findLaunchLoaderSet:

# Example path after executing DyldSharedCache::hasLaunchLoaderSetWithCDHash
/cdhash/3302ae16a5eda1cf7daab75ce63b94274674ec8b

If PrebuildLoaderSet was found isOsProgram is set to True and we execute the allowOsProgramsToSaveUpdatedClosures. Otherwise, we are dealing with 3rd party app and execute allowNonOsProgramsToSaveUpdatedClosures:

The allowOsProgramsToSaveUpdatedClosures block local closure files from overriding closures in the dyld cache:

The allowNonOsProgramsToSaveUpdatedClosures blocks 3rd party apps from saving closures depending on several conditions:

Saving is disallowed on macOS for iPad apps running on Apple Silicon macOS when the executable does not have a CDHash (unsigned).
Saving is allowed on iOS, tvOS, and watchOS platforms.

In our case, a closure will not be saved — the 3rd party app on macOS.

Then, there is a code block related to DYLD_USE_CLOSURES logic:

After that, there is code related to loading closure from disk, but in the case of macOS — it is only for system applications. I will not analyze it here.

To summarize, the initializeClosureMode ensures the dyld can use prebuild closures when available and valid for dynamic libraries to optimize application startup or fall back to just-in-time loading, which builds such closures that will be used for concurrent program startup. In case of 3rd pary apps on macOS this code ensure the closure will not be saved on the disk.

Just-in-time

We are returning from initializeClosureMode. The following lines, 564–568, process a set of prebuilt loaders if they were initialized and retrieve the main loader (at index 0). Then, pre-allocate memory for all images.

There is no mainSet for us. This code will not run for 3rd part apps on macOS.

The condition that follows will be executed, as there is no mainLoader if there is no mainSet, so the mainLoader == nullptr is true:

The reserve function here comes from Linker Standard Library. The argument to reserve specifies the number of elements, not the number of bytes. So, it is preparing space for 512 elements of state.loaded type.

The function lsl::bit_ceil(newCapacity) is used to find the smallest power of two that is greater than or equal to the given newCapacity.

The state.loaded is a container of pointers to Loader objects, and it is 8 bytes wide. So this allocate 512*8 == 4096 bytes using reserveExact:

After this allocation, we have Diagnostics buildDiag (line 573):

It looks like this zero-out the memory we just allocated at x0+0x270:

After all these preparations, we are making JIT Loader. The function computes the slice offset, checks if the binary file exists, creates a loader instance based on the provided parameters, and returns a pointer to it:

A slice here is a single architecture Mach-O from Fat binary mapped to memory within the Loader::getOnDiskBinarySliceOffset function.

The core functionality here lies inside the JustInTimeLoader::make , which is too long to paste here. Here are some key points what function does:

After initializing the JIT Loader, we are setting it within the RuntimeState and notifying the debugger about it:

The setMainLoader function primarily updates the mainExecutableLoader field in the RuntimeState object with the provided loader pointer:

Additionally, it performs logging related to the main executable, such as logging loaded libraries and segment mappings, if logging is enabled:

So overall, we initialized here JIT Loader and set it in RuntimeStates. It will be later used for loading dependent libraries and applying fixups.

Image loading

The STACK_ALLOC_OVERFLOW_SAFE_ARRAY function is at the beginning of the images (dylibs) loading. It allocates a stack array to hold pointers to Loader objects, with an initial capacity of 16. This array will track all images.

In line 591, we are adding the mainLoader to the topLevelLoaders array, and from line 592 to 630, we are first loading inserted libraries:

Then, we set some properties and started to recursively load everything needed by the main executable and inserted dylibs (640–680):

The core functionality here lies within loadDependents function.

We can also observe how the notifyDebuggerLoad works in lldb by inspecting the image list before and after the function was executed:

There is also notifyDtrace. Dylibs can have DOF sections that contain info about static user probes for dtrace. It finds and registers any such sections:

DOF stands for DTrace Object Format.

Finally, we have code that identifies and registers non-cached dylib loaders to a state permanent list using addPermamentRanges:

Using a stack-allocated array (STACK_ALLOC_ARRAY) is efficient regarding memory allocation and deallocation since it avoids heap allocation.
By identifying loaders not part of the dyld cache and adding them to permanent ranges, the system ensures they are retained in memory.

Overall, we loaded all images necessary to run the app in this step.

Fixups

Before we do fixups, there is a code for setting up a weakDefMap for a runtime state, a mechanism used to manage and resolve weak symbols in dynamically loaded libraries (dylibs) before any actual binding occurs:

Before handling fixups, buildInterposingTables sets up tables for interposing functions in non-cached dylibs:

Interposing allows a program to override existing functions in shared libraries with custom implementations. This can be blocked by AMFI.

After that, applying fixups begins. The code responsible for that first starts a ScopedTimer to measure the time taken for applying fixups and acquire a DyldCacheDataConstLazyScopedWriter for the dyld cache data patching:

Then, we handle strong overrides of weak definitions with a function handleStrongWeakDefOverrides that identifies dylibs with weak definitions, searches for strong overrides in those dylibs, and applies fixups:

A strong symbol is just a symbol without any additional definition or using the default attribute for visibility:

int strong_symbol = 42;
int strong_symbol __attribute__((visibility("default"))) = 42;

While a weak symbol can be defined like this:

int weak_symbol __attribute__((weak)) = 42;

After handling strong overrides over weak symbols, we iterate over each loaded loader to apply fixups using applyFixups <- (core logic here). In case of any error during fixups, halt execution and report the fixup error.

There is also applyCachePatches function for handling any patches in dyld cache (only if dylib overrides something there):

There is also something called singleton patching in Dyld Shared Cache performed by a function doSingletonPatching:

From the code, it looks like it only applies to the Obj-C code. Here is the structure:

At last, we applyInterposingToDyldCache if used:

However, it does not count into the timing of applying fixups. So, we can conclude that singleton patching is the last thing in the fixup process:

After all these fixups, we can say that our executable dependant libraries are loaded and symbols are resolved and relocated so it is ready to go.

Libdyld.dylib

The lines between 734–761 do not concern us, as they apply to PrebuiltLoaders and we are using JustInTimeLoader:

Similarly, lines 763–796 as they apply to the kdebug which is off. In case it is on, it notifies kdebug on each image load:

So the first thing we do in reality is check if libdyld.dyld exist, which was was set in JustInTimeLoader::applyFixups.

After that we are wiring up the libdyld.dylib to dyld. The code first get the load address of libdyld by calling loadAddress on the libdyldLoader (801).

Then find __dyld4 section within __DATA segment of libdyld.dylib (803) and if it is not found in the __DATA segment, it search the __AUTH (806).

If it cannot be found, the loading is halted:

Then, we establish a connection between the libdyld.dylib and the runtime state of the program by providing access to the global APIs through the libdyld4Section:

We also allow external code | components to access information about all loaded images in the process by providing a pointer to the allImageInfos field from libdyld4Section using storeProcessInfoPointer:

Next, we initialize program variables (vars) in the runtime state (state) based on information retrieved from libdyld.dylib:

There is one thing I do not understand. While debugging, I could not find the C code in the repository corresponding to the below instructions.

__chkstk_darwin

After setting state.vars, we may observe blraa x16, x17 instruction:

This jumps to the below code, which branches to __chkstk_darwin:

Going further, we branch to __chkstk_darwin_probe:

The code below shows the disassembled __chkstk_darwin_probe. While debugging, this executes instructions +0, +4, +8 and then jumps to +32:

+0: Compares the value in the register x9 (stack size?) with 0x1000 (4096), shifted left by 12 bits (resulting in 0x1000, equivalent to 4096). This check likely verifies if the stack size is at least 4096 bytes.
+4: Moves the stack pointer (sp) value into register x10.
+8: Branches low (b.lo) to instruction +32 if the comparison results in instruction +0 indicates that the stack size is less than 4096 bytes.
+32: Subtracts the value in x9 from the value in x10.
+36: Loads a byte from memory at the new address pointed to by x10.

So, it is like probing (checking read access using +36 instruction) to see if we can access the stack at [x10] which holds this value:

This value holds state.vars = &libdyld4Section->defaultVars so it seems like it checks if the variables are readable?

partitionDelayLoads

Moreover, after executing __chkstk_darwin we run into the below function. Unlike __chstk_darwin, I could find its definition in the Dyld source code repository. However, I could not find where it is called in the dyldMain.cpp:

The partitionDelayLoads code can be seen in DyldRuntimeState.cpp between lines 525–566. Its main purpose is to get the Loaders marked as delay-init, which can now be initiated.

If a loader in delayLoaded is no longer delayed, it is moved to loaded.
If a loader in loaded is now delayed, it is moved to delayLoaded.
The undelayedLoaders vector is populated with loaders initially marked for the delay but is now not delayed.

This function ensures that the dylibs are initialized in the correct order

DYLD_JUST_BUILD_CLOSURE

Before moving forward, a block of code is not executed in normal circumstances on macOS for 3rd party apps, and it is shown below:

It handles the creation and serialization of prebuilt loader sets.

After that, there is a check for DYLD_JUST_BUILD_CLOSURE variable used for prewarming. If it is used, here the execution will be halted:

I must return to this piece of code, which is very interesting because of serialization and saving closure mechanisms.

I skipped over some executed code here but changed nothing for us:

Prepare main

The last thing we do in prepare function is to prepare the program's entry point. The logic here is to decide whether to use LC_MAIN or LC_THREAD: