Stories
Slash Boxes
Comments

SoylentNews is people

posted by NCommander on Tuesday September 20 2016, @01:00PM   Printer-friendly
from the now-you-can-be-1337-by-knowing-what-a-far-call-is dept.

The Retro-Malware series is an experiment on original content for SoylentNews, written in the hopes to motivate people to subscribe to the site and help grow our resources. The previous article talked a bit about the programming environment imposed by DOS and 16-bit Intel segmented programming; it should be read before this one.

Before we get into this installment, I do want to apologize for the delay into getting this article up. A semi-unexpected cross-country drive combined with a distinct lack of surviving programming documentation has made getting this article written up take far longer than expected. Picking up from where we were before, today we're going to look into Terminate-and-Stay Resident programming, interrupt chaining, and get our first taste of how DOS handles conventional memory. Full annotated code and binaries are available here in the retromalware git repo.

In This Article

  • What Are TSRs
  • Interrupt Handlers And Chaining
  • Calling Conventions
  • Walking through an example TSR
  • Help Wanted

As usual, check past the break for more. In addition, if you are a licensed ham operator or have ham radio equipment, I could use your help, check the details at the end of this article.

[Continues...]

What Are TSRs?

For anyone who used DOS regularly, TSRs (short for Terminate and Stay Resident) were likely a source of both fun and frustration. Originally appearing in DOS 2.0, TSRs, as the name suggests, are programs that exit but leave some part of their code around in memory. TSRs are primarily used to provide device drivers, extended APIs, or hooks that other applications can take advantage of. At the same time, they also could be used (as we will be doing) to install invisible hooks to modify, change, or log system behaviors. In that sense, they can be considered broadly equivalent to extensions on classic Mac OS. The BIOS could be considered a special type of TSR as it's always available in memory to provide services to the operating system and applications.

From a technical perspective, a TSR is any application that executes int 21h with the right options. Ralph Brown's interrupt guide has this to say on DOS's API for TSRs:

DOS 2+ - TERMINATE AND STAY RESIDENT

AH = 31h
AL = return code
DX = number of paragraphs to keep resident

Return:
Never

Notes: The value in DX only affects the memory block containing the PSP; additional
memory allocated via AH=48h is not affected. The minimum number of paragraphs
which will remain resident is 11h for DOS 2.x and 06h for DOS 3.0+. Most TSRs can
save some memory by releasing their environment block before terminating
(see #01378 at AH=26h,AH=49h). Any open files remain open, so one should
close any files which will not be used before going resident; to access a file
which is left open from the TSR, one must switch PSP segments first (see AH=50h)

Well, for most people, I suspect that is as clear as mud. Let me try and explain it a bit better. Essentially, when a program flags to DOS that it wants to TSR, DOS simply leaves the amount of memory marked in DX alone, and marks those paragraphs (which are 16 bytes) as 'in use' so neither it nor any other well-behaving programs will attempt to use them. No relocation or copying is done as part of this process; the memory is simply marked dead, and left 'as is' as we'll see below.

This is problematic for a number of reasons. As I mentioned in the previous article, when in real mode, Intel processors can only access up to 1 MiB of memory, and it's the area where all applications, drives and device address space needs to squeeze into. Of this, only 640 kiB are normally available to applications (which is known as conventional memory). If a TSR is too large, or too many are loaded, its is very easy to run out of RAM to do anything useful with the machine. To make matters worse, DOS provides absolutely no mechanism to manage or uninstall TSRs. (Once an application is resident, it's staying there unless it is specifically designed to unhook itself and free itself from memory.) Combine that with the fact that there's no 'official' way of doing so in the DOS APIs.

This quickly lead to an era where you might need a specific boot floppy for a given application so as to have its TSRs available (such as mouse or network drivers) — and nothing else — so that there would be enough conventional memory left to fit everything in. While several third-party efforts tried to standardize TSR installation/removal — such as TesSeRact — none of them became a true de-facto standard. Furthermore, it is very possible for a TSR removal to leave memory in a fragmented state which could break other applications. An entire cottage industry of memory optimizers quickly sprang up which could load TSRs into high memory.

At this point, you may be wondering "If TSRs are so miserable, why use them?". The answer, unfortunately, is that it is the only way on DOS to provide any sort of extended functionality. DOS has no concept of shared libraries or multitasking; it was TSR or bust. This brings us to our next topic: interrupt handling.

Interrupt Handlers

While I touched on interrupts in the previous article, I didn't go into too much detail. Interrupts, simply put, are special signals sent to the processor to tell it to stop what it's doing and do something else immediately. These interrupts can be generated by either hardware or software. Interrupts essentially operate like this:

  • Processor is doing work.
  • Interrupt occurs
  • Processor saves location and jumps to interrupt handler
  • Interrupt handler runs
  • Interrupt handler finishes, and returns to the original task

When an interrupt occurs, the processor looks at the Interrupt Vector Table (IVT) located at 0x0 to determine where it needs to jump to handle that interrupt. The function that handles an interrupt is known as an Interrupt Service Routine (ISR). Assuming there is a valid handler address in the IVT, the processor does a far call to the IVT and immediately continues execution. A 'bare bones' interrupt handler looks something like this:

previous_hook_offset: dw 0
previous_hook_segment: dw 0

hook:
	; When we come into an interrupt, only the
	; code segment and instruction pointer are preserved
	; for us. It's the responsibility of the handler to
	; preserve this information.

	; This is saved on the application's local stack, which is fine
	; for now (FreeDOS does the same thing internally) as long as
	; we're not putting any large items on it. We'll look at setting
	; up a local stack later.

	pushf ; Save flags
	pusha ; push all general registers to the stack

	; Setup segments
	push ds
	push es

	; For interrupt handlers, CS=DS normally, and SS either points at:
	; the application stack (aka, whatever was running before we were)
	; or at a local stack setup by the TSR.
	; On x86, it's not possible to directly copy from one segment register
	; to another, so we'll use AX as a scratch:
	mov ax, cs
	mov ds, ax
	mov es, ax

	; Let's add a "hello world" hook:

	; NOTE: Normally it's a bad idea to call DOS interrupts in a TSR
	; because DOS itself is not re-entrant. However, as in this example,
	; we've hooked the unused 0x66, which DOS does not call out of the box,
	; which means we'll never be in this ISR while we're in DOS.
	; If this were real code, we would have to check the INDOS flag for sanity.
	mov ah, 9
	mov dx, hello_world_str
	int 0x21

	; DOS compatability "quirk?". On DOSBox (which I initially tested this on)
	; there's a default entry in the IVT for all interrupts in F000:xxxx.
	; Documentation suggestions that this is also the default behavior
	; for MS-DOS though I can't confirm it.
	;
	; FreeDOS, on the other hand, leaves unused INTs initialized to 0000:0000
	; so blindly far calling it causes a fault. So we need to check if the
	; segment is 0000, and skip chaining if that's the case

	cmp word [previous_hook_segment], 0x0000
	je skip_chain

	; Chain to other TSRs
	pushf ; pushf is required because iret expects to pop flags
	call far [previous_hook_offset]

	skip_chain:

	; We're done, restore to previous state
	pop es
	pop ds
	popa
	popf

	; To return from an interrupt, we use the special iret instruction
	iret

Quite a bit of code for not doing much. As the code comments explain, the interrupt handler has to preserve any information in the registers it wants to use. For this example, we just save everything with a pushf instruction followed by pusha instruction, which puts the FLAGS register followed by all the general purpose registers (AX-DX, SI, DP, BP, SP) on the stack. Preserving flags in an ISR is extremely important since FLAGS is where things like comparison results are stored; if you corrupt FLAGS, it's completely possible that an application evaluates an "if" statement the wrong way and becomes a source of hard-to-impossible-to-find bugs.

ISRs are somewhat notorious in that they appear deceptively easy to code, and absolutely disastrous if you get it wrong. One of the major things to be aware of is that it's possible for an interrupt to be interrupted. For example, if your interrupt handler is running, and someone taps on the keyboard, the keyboard handler will preempt you. Depending on what you're doing, this might not be a problem, or it could "lock up" the computer. ISRs can turn interrupts on and off with the sti/cli instructions, but an all-too-common bug is forgetting to turn interrupts back on. Raymond Chen, a developer at Microsoft, wrote an entire chapter in his book "The Old New Thing" dedicated to the things that stupid applications do that Windows had to patch around — such as forgetting how to handle interrupts.

The second consequence is that ISRs should be reentrant. For those who are not hugely familiar with computer programming, reentrancy is the ability for a subroutine to be interrupted, then called again safely. For example, if you're listening to keyboard events, it's possible that two events can come at the same time and the second event preempts the first one. Bad Things(tm) happen if you have non-reentrant ISRs. The only reason this is a 'should' vs. a 'must' is that DOS itself is not reentrant; as the comment explains, you can't safely call a DOS interrupt from an ISR. DOS provides a special global flag known as INDOS to let callers know if it's safe to make an interrupt check; it was excluded above for brevity and because we used an unused interrupt.

The final common pitfall for DOS-based ISRs is it is possible for multiple TSRs to hook the same interrupt. For example, App A and App B can both decide they want the same interrupt. Depending on the application, it may chain interrupts down, or it may claim an interrupt entirely for itself. This can lead to infuriatingly complicated issues to debug if the other TSR is not well-behaved. Microsoft and IBM eventually provided built-in TSR multiplexing in DOS in the form of int 2F, but the API is extremely difficult to use and failed to solve many of the inherent issues.

The Stack and Calling Conventions

Let's take a momentary digression from TSRs to look at how functions work and how they interact with the stack. From an instruction perspective, Intel processors provide a "call" opcode which pushes the current instruction pointer to the stack, and then unconditionally jumps to a given location. It doesn't, however, define the behavior of how arguments are passed or the management of the stack. As such, developers have created conventions to specify how the stack and arguments should be passed from one function to another.

For non-programmers, the stack can be considered to be a "working space" where a program can store local variables and temporary information such as result values. In contrast to the heap, stacks are relatively small, and are essentially localized to a given function. For historical reasons, the stack grows 'down' from upper memory addresses to lower memory addresses. The stack pointer SP always points to the top of the stack. When information is pushed to the stack with a "push" operation, the value saved is stored in memory to the location pointed at SP and the register itself is decremented by the size. In contrast, deleting an item from the stack simply increments SP allowing new information to override the old.

For example, let's assume we have a C function with the following prototype:

// By default, most C compilers use the CDECL calling convention on x86
int example(int a, int b) {
	// We'll do stuff here
	return a+b;
}

Unlike most architectures, x86 defines multiple types of calling conventions. Of these, the most common are stdcall (used primarily by Windows), and cdecl (C Declaration). For the code I write, I'm sticking to the cdecl convention for my own sanity. cdecl is what's known as a "caller-based" convention, which means the calling function is responsible for cleaning up the stack at the end of a function. Here's what the calling code looks like in assembly:

example_call:
	mov ax, 4
	mov bx, 5

	; Arguments go in left to right
	push ax
	push bx

	; Under CDECL, names are decorated with a _ to indicate
	; they're a function, so example becomes _example
	call _example

	; Now we need to clean the stack up
	add sp, 4

	; The return value (9) comes back in ax
	; all other registers are smashed (aka their
	; values are not preserved into or out of the
	; function)

Fairly straight forward, right? Let's look at how this function might be implemented so we can discuss the base pointer (BP) as well. Here's what _example looks like:

_example:
	; Setup stack frame
	push bp
	mov bp, sp

	; The stack now has the following layout
	; bp[+2] stack frame
	; bp[+4] int a
	; bp[+6] int b

	; Move values from the stack to registers
	mov ax, [bp+4]
	mov bx, [bp+6]
	add ax, bx

	pop bp
	ret

BP, or the base pointer, can be considered a reference point for where each function begins and ends. Whenever we enter or leave a function, the base pointer forms the base of the stack for that function (hence the name). These reference points are known as stack frames, and since every function copies SP (which always points to the top of the stack) to BP, you can always tell where you are relative to other functions. Debuggers, for example, walk the stack to determine where they currently are by comparing the values of BP to known offsets.

Near and Far Calls

Before we leave the topic of calling conventions, the final point to bring up are near and far calls. In the previous article, I discussed that 16-bit processors can only reference up to 64 kilobytes of memory directly at any given time. As such, if you need to reference code or data outside that 64k window, you need to change the segment so it's pointing in the right location.

For functions, code that's within the same segment is known as a near call. Near calls are equivalent to normal function calls on most other architectures. Far calls in contrast include the required segment, and load CS as part of the function call. Far calls are made by the "call far" instruction, and require the called function to use the "retf" instruction to indicate they need to return far. Far calls have a fairly high performance hit due to the segment change, and thus should be limited as much as possible

In the previous interrupt handler example code, we saw that we had to do a far call to chain to previous TSRs. The reason for this is that interrupt service handling is essentially a special case of a far call; the processor has to change to the ISR's segment in memory. When we chain to another interrupt handler, we have to do the same thing. If you're still confused, the following example will clear things up.

TSRs In Action

So now that we have the basis of TSRs in our heads, let's look at how they're managed and installed by the operating system. To do that, we need an actual DOS installation. While TSRs do work in DOSbox, DOSbox has some unusual quirks with its environment that make it not 100% accurate to actual DOS (for example, all interrupts have an installed default handler; FreeDOS at least does not do this).

Installing FreeDOS

Fortunately for free software, FreeDOS exists which is a (mostly) compatible free software re-implementation of DOS 5. Installation is pretty much identical to what DOS 5 would have been like if it was shipped on a CD vs. floppy disks

*

The CD is bootable, and starting it up in VirtualBox brings up this boot menu.

*

The installer offers to start FDISK to create a boot partition. Users of MS-DOS FDISK should find this more or less identical to the standard FDISK.COM

* *

After which DOS installation takes a few minutes, and then promptly crashes. For reasons I can't figure out, the included JemmEx memory extender refuses to work under VirtualBox. Fortunately, EMM386 is happy to do the job, and after a quick reboot, I get dumped to C:\

After configuring WatTCP, and firing up the built in FTP server, I can copy my TSRs over without issue. Of course, given that DOS uses VESA graphics, I can't copy and paste. Fortunately for my sanity, FreeDOS (and MS DOS) support redirecting the terminal with the CTTY command. After a little bit of fiddling with VirtualBox's settings, I get this:

*

Copy and paste for the win. Anyway, now that we have a decent to use testing environment, let's get into the practical aspect of this.

DOS Memory Layout

After doing a clean reboot of the system, FreeDOS reports the following as its memory usage:

C:\>mem

Memory Type       Total       Used     Free
--------------- ---------  -------- --------
Conventional         639K       50K     589K
Upper                 36K       31K       5K
Reserved             349K      349K       0K
Extended (XMS)    31,680K    5,626K  26,054K
---------------- --------  -------- --------
Total memory      32,704K    6,056K  26,648K

Total under 1 MB     675K       81K     594K

Total Expanded (EMS) 31M (32,571,392 bytes)
Free Expanded (EMS)  25M (26,705,920 bytes)

Largest executable program size   589K (602,672 bytes)
Largest free upper memory block     4K ( 4,096 bytes)
FreeDOS is resident in the high memory area.
C:\>

Lots of numbers, right? We'll do a more in-depth article about the types of memory, but let's do a brief primer here so that the output can be understood. Let's break these down step by step

Conventional

Conventional memory is what applications in DOS generally have available and refers to the lower 640k of the 1 MiB address space. Anything operating in real mode has to fit in this memory area. FreeDOS reports a total of 639k because a very small chunk of RAM at 0x0000 has to be reserved for the processor's interrupt tables, as well as a small part of COMMAND.COM that has to stay resident at all times to aid things like LOADALL. On this specific system, I have a few TSRs already installed to provide network services which is why a 50k block of conventional memory is already used.

Upper/Reserved

Above the 640k line is what's referred to as the "upper memory area", or UMA and is reserved by DOS. The upper memory area also has things like the monochrome and VGA memory buffers, as well as option ROMs, the DOS kernel, and the BIOS shadow map. Normally, this region of memory shouldn't be used by applications, but due to the fact that conventional memory can get very crowded, on most systems there are small but usable sections of memory in these areas, known as UMA blocks. A memory manager can determine which blocks are safe to use, and load applications or data into these chunks, a process known as "loading high". When we get into hiding our TSR, use of upper memory will become very important

Extended Memory

Memory that exists above 1M+64k (that 64k is special, see below), and cannot be directly accessed by real mode. Because neither DOS nor the BIOS can operate in 32-bit/protected mode, and that the 80286 processor could not easily switch from protected mode to real mode, accessing memory above the 1 MiB barrier required various amounts of trickery. Extended memory can extend up from 1 MiB to 4 GiB (which is the architectural limit of 32-bit processors). Accessing extended memory either requires entering protected mode, tricking the processor into unreal mode (which on the 80286 required the LOADALL instruction to put the processor in an invalid state), or using a BIOS service which did one of the previous two options to exchange blocks with conventional memory.

High Memory Area (not shown)

One important line to look at is "FreeDOS is resident in the high memory area." I've stated multiple times that 1 MiB is the limit of what Intel processors can address. As it turns out, this is only a partial truth. Remember that addressing in real mode is done in the form of segment:offset. So what happens if I load a segment value of FFFF?. Well it turns out we can address an additional 64 kilobytes of RAM beyond the 1 MiB barrier. This is known as the high memory area.

Due to many quirks related to the abomination known as A20 (which will get an entire section in the next article), the high memory area requires special rules and methods to access. The short version is that unless you have a memory manager, or are willing to manipulate the A20 line directly (which is dangerous), the HMA is not usable by general applications. We'll look more at this in a future article.

TSR Loading

So with that all out of the way, let's look at how a TSR is loaded. In the github repository, there's an example TSR known as tsr_example which, when loading, prints out the segment registers and the segment:offset of the next hook in memory. It's combined with a "callhook" program that simply runs int 0x66 to invoke it. So let's load it and see what happens:

C:\> tsr_demo
DOS loaded the COM with this:
CS: 0C9C
DS: 0C9C
SS: 0C9C
 

When our TSR is loaded, it reads and dumps out the segment registers, showing DOS loaded us at 0C9C. For COM files (or any executable that is 'tiny'), CS=DS=SS. When DOS loads a COM executable, the entire thing is copied into memory, CS/DS/SS are set to the execution point, and the process far calls to CS:0100 to begin execution. If we check our memory usage, we can see that it has dropped:

C:\>mem

Memory Type         Total     Used     Free
---------------- -------- -------- --------
Conventional         639K     115K     524K

NOTE: It shouldn't be using 50 kiB of RAM per run; the binary is only 324 bytes! I think I'm calculating the paragraphs-to-preserve number wrong, but I didn't get a chance to fix it by time this article went up. If someone wants to look at the code, check tsr_examine.asm; the TSR int call is at the very bottom of the file and based off example code I found elsewhere.

If we run callhook, we can determine that our TSR in fact installed successfully, and the previous hook is at 0000:0000 (which is skipped over).

C:\>callhook
CS: 0C9C
DS: 0C9C
SS: 1CB6
Previous hook is at 0000:0000

Note that SS is different. When a TSR is invoked (in this case by doing int 0x66), it inherits the running state of whatever application that was running at the time. It's the responsibility of the TSR to put the stack back the way it found it when it exits, else you'll cause random corruption in userspace applications.

Now lets look at see what happens if we invoke our TSR multiple times:

C:\>tsr_demo
DOS loaded the COM with this:
CS: 1CB6
DS: 1CB6
SS: 1CB6
C:\>tsr_demo
DOS loaded the COM with this:
CS: 2CD0
DS: 2CD0
SS: 2CD0
C:\>tsr_demo
DOS loaded the COM with this:
CS: 3CEA
DS: 3CEA
SS: 3CEA

With each load, we're loading higher in memory. DOS does not automatically rebase or relocate TSRs; they stay at whatever memory segment they were in when they terminated. As DOS automatically loads COM files as low as possible, each run is loaded at the next "available" section of RAM. Calling mem shows that our available conventional memory has dropped

C:\>mem

Memory Type Total Used Free
---------------- -------- -------- --------
Conventional 639K 308K 331K

So what now happens if we run callhook?

C:\>callhook
CS: 3CEA
DS: 3CEA
SS: 4D04
Previous hook is at 2CD0:0103
CS: 2CD0
DS: 2CD0
SS: 4D04
Previous hook is at 1CB6:0103
CS: 1CB6
DS: 1CB6
SS: 4D04
Previous hook is at 0C9C:0103
CS: 0C9C
DS: 0C9C
SS: 4D04
Previous hook is at 0000:0000

We chain through each version of the TSR, easily visible by CS/DS changing as we go upwards until we reach the 'stop' at 0000:0000. At this point, I think we have a fairly good grasp on how TSRs work in practice, what DOS gives us, and how interrupts work more in-depth. At this point, this article is already past the 4k word mark, so I'm going to cut this off here before the editors stage a revolution. So let me close this off with the fact I need some help with the community.

Help Wanted

As I mentioned in Part 1, for getting the keylogged data out of the system, I'm interested in using a non-TCP/IP based protocol. Up until the mid-90s, IPX and NetBIOS-only networks were still relatively common, and it wasn't until the domination of the 'modern' internet that TCP/IP became ubiquitous. After considerable amounts of research, I've decided that fitting in the theme of 'unusual yet neat', I'd like to extract the data out using AX.25 and ham radio equipment. The other alternative I may do is using IPX, as I found the original DOOM source code actually has a complete IPX driver on it. As of right now, I'm somewhat torn between doing this with AX.25 or IPX. The thing is though, I'm going to need some help to make AX.25-based keylogger a reality.

The use of standard radio would allow the keylogger to work on air-gapped computers and show how a potential exfiltration of data might have been done in environments predating TCP/IP. It would be fairly easy to modify a standard PC to hide a 2m or 70cm transmitter within the case and connect to it via I/O lines in an early form of the NSA's current Tailored Access Operations. It would also mean that the keylogger itself would be fairly useless for use in real-life which aids the goal of preventing proliferation of attack tools.

The problem is right now, I have a serious lack of equipment. While I'm a licensed ham in the United States (KD2JRT/Technician), the only equipment I have are two Baofeng UV-82s. What I need is to figure out a decent way to handle getting data broadcasted. I know it's at least theoretically possible to build a cable to hook the Baofengs up to a computer's mic/sound in, and use a software TNC (Terminal-Node Connector) to do AX.25. By doing so, I could simply connect the TNC to VirtualBox's serial port emulation, and “blamo”, AX.25 for DOS.

What I need from the community is two-fold:

  • Experience with doing AX.25 data in real life
  • Help building the necessary cables *or* loaning radio equipment with hardware TNCs on it

I'm currently in New York City for the foreseeable future. I could potentially build cables for my Baofengs myself but I don't currently have a soldering iron, and my living situation makes it rather difficult to do electronics work here. Depending on pricing, I can probably cover shipping and handling, or compensate out-of-pocket work done by a community member. If you're interested in helping, post a comment or send me an email (mcasadevall@soylentnews.org), and I'll be in touch.

Finally, if you've enjoyed this article, please consider subscribing or gifting a subscription. No account is required, as you can anonymously gift subscriptions to my alt-account, mcasadevall (6). I'm hoping we can raise enough money to fully pay off the stakeholders of the site, and perhaps get a small budget together to let me dedicate more time to content like this, or buy equipment to explore more obscure pieces of hardware (i.e., digging into doing some INIT coding on classic Mac OS, or something of that nature). I'd like to give thanks to all those subscribed, including jimtheowl after the previous article.

And with that, 73 de NCommander!

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by LoRdTAW on Tuesday September 20 2016, @02:53PM

    by LoRdTAW (3755) on Tuesday September 20 2016, @02:53PM (#404259) Journal

    Bank switching was very useful on the old 8 bit systems which were limited by 16 bits of address. With x86 and flat 32bit memory spaces of ARM, It's not really useful as you have 4 or more GB of address space.

    The more interesting bank switching setup I seen was in a JK laser with a single 256k EPROM bank switched through an old Altera PLD with a 6809 CPU. Just 8k of RAM. Most of the extra space was for messages in multiple languages.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 1, Funny) by Anonymous Coward on Tuesday September 20 2016, @03:42PM

    by Anonymous Coward on Tuesday September 20 2016, @03:42PM (#404280)

    google "Intel PAE". This stuff is like Polio and Smallpox. Just when you think the world has seen its last of 'em...

    • (Score: 0) by Anonymous Coward on Wednesday September 21 2016, @01:36AM

      by Anonymous Coward on Wednesday September 21 2016, @01:36AM (#404613)

      Required for virtualization on the RPi 2/3 at least and probably any 32 bit arm processor.

      I imagine similiar is used for aarch64.

      As to 36bit PAE on Intel: The fucked up part is that they wasted all those other address bits (36 bit PAE used a 48 bit address space. And even x86_64 wastes everything past the 52nd or 53rd address bit for things other than addressing more memory!) The marketing names of addressing ranges rarely tell you their actual capabilties or utility. That said, PAE on intel provided an important interim solution before the jump to 64 bit, and during an era where the only systems using it would have been multinode NUMA boxes with a maximum of 4-8 gig to a cpu cluster (of 1-4 chips. 440fx/450gx boards usually maxxed out at 1 gigabyte whether 72 pin SIMMs or 168 pin DIMMs. The AMI bios available at the time only had digits for up to 8 gigs of RAM, and outside of clusters the standard RAM boards were only designed for 1 gig for 440FX (sharing that limitation with the later BX) or 8 gig for the 450GX (which practically speaking never had near that. The stock memory boards for revision 1 systems (which have a LOT of trace wires added for early design flaws) only had 1 gigabyte of commonly available slots on them (assuming 32 meg unregistered ecc/parity simms), with a maximum of 2-4 gigs if they supported addressing modes for registered or denser memory (While I am sure the 450 supported registered memory, I have no idea what the real supported memory for that generation was. The 440fx in comparison was limited to 8x128 if you could find 128M EDO dimms.