PE File Parser

1. Context

I have been doing a lot of malware analysis recently, but I realize I do not know much about the entire structure of a PE file. It has really been annoying having to look up what each component is and where they are in memory everytime I need them.

I have already implemented an ELF parser (similar to Linux’s readelf) for an OS I wrote in Rust, and I have learned a ton about ELF files with that project.

So I have decided to write another parser, and this time I want to learn about the Windows’s PE file! I’ll document them all here so I don’t have to go and look them up on Google later!

Here is a broad view of what a PE file looks like before we dive in each component.

alt text

2. MS-DOS Header

At the very beginning of a PE file is the MS-DOS/Real-Mode Header. This header has been around since the version 2 of the MS-DOS operating system.

This header always occupies the first 64 bytes of any PE file, and it contains these components:

Out of all of these, the field e_lfanew is the most important for us to locate the next important part of the PE file, the PE file header.

3. MS-DOS Real-Mode Stub Program

Before getting to the PE file header, there is a stub(a small program) in the middle of that and the MS-DOS header.

This program is run by MS-DOS when the executable is loaded, and it runs instead of the actual application when loaded in DOS mode.

Usually, this stub does nothing more than just outputing a message saying that the program can’t be run because the OS is not compatible. This ensures that the executables can be run on any Windows OS for backward compatibility, but instead of executing the real program, the user will be notify that their current OS is not compatible instead of having the program crashed.

This stub is usually 38 bytes right after the MS-DOS header.

alt text

4. PE file Signature

After the MS-DOS stub, we can find the PE file signature. This is similar to the magic byte MZ in the MS-DOS header, but instead it is 0x50450000 OR PE\0\0 representing IMAGE_NT_SIGNATURE.

This signature is the starting point of the PE file header, and it can be found at the field e_lfanew from MS-DOS header.

Starting with Windows and OS/2 executables, the files neede this signature to specify what the intended OS is for them.

5. PE File Header

Right after the PE file signature is the PE file header. This can be located at

    imageBase + imageDOSHeader->e_lfanew + SIZE_OF_NT_SIGNATURE

where SIZE_OF_NT_SIGNATURE is 4 bytes.

The file header is located at this address as a struct of size 20 bytes containing these fields:

A useful entry in PE file header is the NumberOfSections field. In order to be able to parse and extract sections in a PE file, we must know how many section headers and section bodies are using this field of the PE file header.

6. PE file Optional Header

The Optional Header is a 224-byte struct right after the File Header. This can be located at

    imageBase + imageDOSHeader->e_lfanew + SIZE_OF_NT_SIGNATURE + sizeof(IMAGE_FILE_HEADER)

The Optional Header contains meaningful info about the executable image, and it is divided into two parts - Standard fields and NT additional field.

The standard fields: fields that are related to the Common Object File Format(COFF) used by most UNIX executables.

Windows NT additional fields: fields added to the Windows NT PE file to support for most Windows NT process behavior:

For DataDirectory, these are the directory entries:

Each IMAGE_DATA_DIRECTORY struct contains the size and relative virtual address of the directory. In order to locate a directory, you get the relative virtual address to determine which section the directory is in. Once having found the section containing that directory, the section header for that section can be used to find the exact file offset location of the data diorectory.

7. Sections

Below the Optional Header are the PE File Sections or the section table. Each section contains a part of the content of the file, including code, data, resources,…

Each section has a header storing information pointing to a body, and the body stores the raw data of that specific section.

The section headers are right after the Optional Header, and each of the header is 40 bytes with no padding in between. In Windows, the struct for this is IMAGE_SECTION_HEADER

Since section headers are organized in no specific order, we can only locate these by name instead of indexing.

8. IAT - Import Address Table

Usually when I perform malware analysis and reverse engineering, I tend to care about the IAT because it contains a list of functions that the executable require from each dll. This is necessary for the loader to create a jmp thunk table so we can make API calls.

Let’s assume we don’t know where is IAT is in the image yet. The way to find it is pretty simple.

First, we need to find the data directory corresponding to the import functions. In this case, they are IMAGE_DIRECTORY_ENTRY_IAT and IMAGE_DIRECTORY_ENTRY_IMPORT. From what I’m understanding, these two usually are the same, but IMAGE_DIRECTORY_ENTRY_IAT is much less well-documented than IMAGE_DIRECTORY_ENTRY_IMPORT, so we will be using the later one for this.

Second, we need to get the Virtual Address of that directory entry. We’ll call this importVA.

Third, we check to see which section the data of this directory entry will be in. This is a simple math check. Let’s call the section’s virtual address sectionVA, and its virtual size sectionVSize.

Then,

  importVA > sectionVA and importVA < sectionVA + sectionSize

will return the result of whether the data of the directory entry is contained in that section.

Now that we have found the section, we can get the address of the first IMAGE_IMPORT_DESCRIPTOR struct in the image. These struct contains information about the import functions in the IAT.

  pFirstImportDescriptor = baseAddress + dosHeader->e_lfanew + sizeof(IMAGE_NT_HEADERS64) + 
    section.PointerToRawData + (importVA - sectionVA)

After that, we can just loop through from this address populating a IMAGE_IMPORT_DESCRIPTOR struct to read of information about the IAT’s functions.

9. Wrapping up

This is just a quick note of how to parse and understand different part of the PE file.

I have not certainly covered everything there is about this topic, but I have learned a ton of new things about this file type writing this blog post!

Feel free to check out my PE parser I wrote here if you need any clarification. I was working on it as I was going through and writing this blog!