What Every Developer Should Know About PDBs

Category:

DevOps

Programming

Tags:

#DevOps

#.NET

Published: August 3, 2023 Reading Time: 11 min

This is a post in the series Mastering PDBs and Debugging. The posts in this series include:

Aug 3, 2023 - What Every Developer Should Know About PDBs
Aug 10, 2023 - Understanding Symbol Servers
Aug 17, 2023 - An Introduction to SourceLink
Aug 24, 2023 - Understanding .NET Debug vs Release
Sep 7, 2023 - Forcing .NET Into Debug Mode

A decade ago the great Bugslayer, John Robbins, created his brilliant article, “PDB Files: What Every Developer Must Know”. Since that time, a lot as changed (including .NET’s transformation to open source). It seemed like an ideal time to re-examine the topic and update the details he provided for 2023.

Despite the many years, one thing hasn’t changed about PDB files: they remain just as important as your source code. Failing to properly maintain PDB files has cost many companies hundreds of thousands of dollars in lost time resolving issues on their production servers. Having the right PDBs available can make all the difference in resolving critical bugs. This is even more true when using native compilation.

Despite more than 20 years of .NET, the myth that having PDBs available with your release code creates a risk still exists. To understand why PDBs are important (and present no risk whatsoever), we must first explore what a PDB contains.

What is a PDB?

Most developers know that Program Database (PDB) files are something related to debugging. At the time of the original paper, John observed that the documentation about their purpose and impact was scattered (if you could find anything at all). This made them much harder to understand. Thankfully, when .NET became an open source project, that situation improved significantly. Now, the details of PDBs and Portable Executables (PE) are documented and discoverable.

There are actually two kinds of PDBs:

Windows (Microsoft v7)
This is the original, proprietary PDB format that Microsoft used for many years, also called multi-stream format v7 (MSF 7). It was the standard for .NET Framework applications on Windows. It is complex and poorly documented, originally optimized for debugging native code on Windows. As a result, they could be larger than the related executable. The format was later extended for .NET, increasing its complexity. It was a proprietary format intended to be written or read only on Windows. The format is considered obsolete for managed code and is not recommended. This format is be selected by setting the DebugType property to full or pdbonly in the MSBuild project file (.csproj). Microsoft has published the source code for the format on GitHub.
Portable
This is the modern, cross-platform PDB version. Because of the challenges with the original format, the Roslyn team decided to make a compact format that worked on all platforms and was optimized for .NET. This resulted in smaller files with less redundancy, and it quickly became the default PDB type for .NET. It is selected by setting the DebugType property to portable (or not specifying a value). Both source code and a public specification are available.

The Portable PDB specification defines two types of PDBs:

Standalone
A standalone .pdb file is generated beside the assembly (.exe/.dll) it represents.
Embedded
Executables embed the PDB directly within the assembly rather than generating a standalone file. Setting the DebugType property to embedded enables this format. It increases the size of the executable slightly, but simplifies the process for distributing the debug symbols. Because redundant data is eliminated, the final additional content is smaller than generating a standalone PDB. This is generally the recommended configuration.

Deploying executables with embedded symbols does not prevent developers from removing the symbols for deployments to resource-constrained devices. To improve file sizes (especially for constrained devices), .NET 5 shipped with improvements to the trimming support. Embedded and standalone PDB symbols are also trimmed to match the executables when creating trimmed, self-contained executables. By setting the property TrimmerRemoveSymbols to true, all symbols are removed. This impacts both the application and its dependencies. This property is automatically set to true when the property DebuggerSupport is set to false.

Embedded portable PDBs are generally recommended over standalone PDBs unless the final size becomes prohibitive for distributing the application. If you’re building production systems or deploying to the cloud, this approach is highly recommended. Having the symbols present enables you to capture complete stack traces or mini-dumps of memory for debugging complex issues. I can’t tell you how many times this has resolved difficult to detect stability errors! If you’re building a self-contained single-file executable, then embedded (and .NET 6+) must be used to get proper stack traces.

The standalone portable PDB is often recommended when distributing via NuGet (via a symbol package, .snupkg). This approach relies on a symbol server to download the symbols when they are needed for debugging. The idea is that it makes the extra file size “pay to play” – only developers will need to download these resources. It assumes that the majority of the time developers will not care to step into the code and debug it. This approach does require some minimal additional configuration so that the debugger knows about the symbol server.

There’s one more feature that impacts modern PDBs: native compilation. The Native Ahead-of-Time (AOT) compiler can be used to produce a self-contained native applications. Because the code is converted from IL to native code, it also generates a corresponding native symbol file to enable debugging the native executable. To do this, it relies on a portable PDB being available on the local file system (embedded or standalone). It cannot use a symbol server or rely on NuGet symbol packages.

When deploying NativeAOT applications, Microsoft strongly recommends including the symbol file to enable profiling and debugging. This is also essential when working with crash dumps. Without it, there is no easy way to associate a stack trace back to the original code. With .NET 8 and higher, the symbols are stored using native formats: in a .dSYM folder on macOS, a .pdb file on Windows, and a .dbg file on Linux. For Unix-like platforms, the StripSymbols property can be set to false to embed the symbols into the compiled binary rather than storing them in a separate file.

What’s inside a PDB?

There’s a misconception that originated with the original PDB format. The format originated as a way to support debugging unmanaged (C/C++) code. In that context, it actually made it substantially easier to decompile the code and recreate the source. Because of this, many developers incorrectly believe that the PDB contains everything needed to decompile the source code. With .NET, nothing could be further from the truth.

.NET assemblies and executables already contain the symbol metadata natively. This allows the framework to support dynamic reflection, but it also makes decompilation easy. This is the basis behind tools such as JustDecompile, ILSpy, Reflector and dotPeek. Since the metadata already exists in the binary, portable PDB files don’t need to carry this information. In short, including the PDB does not make your code less resistant to decompilation.

The PDB essentially contains references from method names to sequence points. Sequence points map the generated Intermediate Language (IL) code for each method to the related rows and columns in the source code file. By making the start of one sequence point the end of the previous one, the size of the file is minimized without losing functionality. The source code file can be referenced either by an absolute path or a relative path. A graphical view of some of the contents of a PDB file is shown in the figure below:

PDB Metadata Contents — Figure 1. Simplified view of PDB contents

The PDB also contains a GUID called the Module Version ID (MVID). The matching compiled assembly will always contain the same GUID. This provides a mechanism for ensuring that the PDB and the assembly were represent the same code and the same build. This is why it is critically important that you generate and preserve the PDBs at compile time. If the two files don’t match, the debugger will give a “PDB does not match image” error or notify you “a matching symbol file was not found”. Attempting to recompile the code to create new PDBs generates a new unique identifier, preventing them from matching the original code (see this article for the reasoning). As an additional layer of protection, the assembly can contain a hash of the PDB file to ensure that the file is the original, unaltered version created during the build.

Newer versions of .NET support deterministic compilation, to ensure that compilation generates the same MVID given the same source code. This does not eliminate the need to create PDBs at the same time as the code! Deterministic compilation does not guarantee byte-for-byte identical files (which impacts the hash) or that PDBs created at a later date will work with existing binaries.

When a PDB is not generated, the Roslyn compiler will perform additional optimizations on the code paths to eliminate some code paths and local variable usage. As a result, the code won’t align with PDBs created by a future build, even with deterministic compilation. The compiled code won’t support PDBs.

There creates a common misconception that skipping the PDB generation results in smaller, faster code. That’s not entirely accurate, and it’s a dangerous mistake. The just-in-time compiler will make the same optimization at runtime, so the resulting native code is equivalent. The binary is a few bytes smaller, but you’ve lost the ability to use mini-dumps, gather stack traces, or debug effectively.

Application performance

Another common misconception is that PDBs reduce performance but this is not the case. The origin of this myth has some basis in fact. Prior to C# 6.0 (2013), Windows PDBs were used and the compiler implemented additional behaviors when they were generated. Using a full PDB would emit the DebuggableAttribute into the binary, preventing the just-in-time compiler from optimizing the code fully. This impacted the performance and size of the executing code; using pdbonly avoided this issue. This is no longer the case.

A PDB is generally only loaded during a debugging session or when an exception is thrown (to resolve the stack trace). If the PDB is embedded in an assembly or executable, then the embedded contents are used. If not, the runtime (or debugger) will look for a PDB file with a name that matches the assembly being evaluated. It will compare the MVID in the assembly (.exe or .dll) to its associated .pdb. If the MVID matches, the PDB is loaded into memory. The PE can also embed a hash of the symbol file. If this exists, the PDB is hashed and the results compared to determine if the symbols are unaltered before loading. If either the hash check or MVID check fails, the PDB is ignored; debuggers may present a warning about the mismatch.

As a next step, the runtime can look for the CodeView Debug Directory Entry in the assembly. This will contain the original build path to the PDB. The runtime will then attempt to locate the file from that location. If it is found, then it will compare the MVID (and hash) to determine whether the PDB matches.

If there is an active debug session, the PDB cannot be resolved locally, and the debugger supports symbol servers, then an additional step occurs. The debugger will attempt to retrieve the PDB from the symbol server cache; it will then query each configured symbol server, stopping at the first match. It will cache the returned PDB. Then, it loads the file into memory. This process can take some time to complete if there’s a large number of PDBs required. It’s important to understand that symbol downloads only happens as part of a connected debugging session.

The performance of executables with embedded PDBs is also generally not impacted. This is because executables are loaded as memory-mapped files. This provides a way of accessing a file as if it was entirely in memory. When a memory locations is accessed, the corresponding parts of the file can be dynamically loaded into memory. As a result, segments of the executable can be loaded on demand. This allows those bytes to be loaded only during the time they are needed.

Embedded source

Sometimes the build process automatically generates source code files. These can be as simple as automatic versioning in an AssemblyInfo.cs file or as complex as a code generator responsible for generating classes to support serialization. To step into these files while debugging, the files must be available to the debugger. Since these files are not included in source control, they need to be made available in a different way to support debugging.

Setting the property EmbedUntrackedSources to true in the project causes these generated files to be embedded in the PDB. This makes them available to the debugger for stepping through the code. If the value is not specified or false, then these build-generated files will not be embedded (and will not be available when debugging). This allows you to control whether or not this source code is included in the PDB.

Distribution

Since PDBs don’t include the actual source code, we need a way to be able to associate the files. As we saw earlier, the PDB contains a reference to the source files, but not the contents. Although it is possible to embed the contents of all the source files into the PDBs, most companies avoid this for good reason. To make PDBs fully usable, we need a way to link together the PDB and the source code used for the build, allowing us to step directly into that code.

In the next posts, we’ll examine this topic in depth. We’ll cover the original approach (symbols servers) and the modern approach (SourceLink).

Until then, Happy DevOp’ing!