Stories
Slash Boxes
Comments

SoylentNews is people

posted by NCommander on Monday March 06 2017, @05:00PM   Printer-friendly
from the adventures-in-data-recovery-and-30-year-old-bugs dept.

One of my favorite hobbies is both retrocomputing projects, and the slightly more arcane field of software archeology; the process of restoring and understanding lost or damaged pieces of history. Quite a while ago, I came across the excellent OS/2 Museum, run by Michal Necasek which helps categorize many of the more obscure bits of PC history, including a series of articles about Xenix, Microsoft’s version of SVR2 UNIX.

What caught my attention were two articles talking about Xenix 386 2.2.3c, a virtually undocumented release that isn’t mentioned in much if any of the Santa Cruz Operation's (SCO, but see footnote) surviving literature of the time. Michal documented [1], [2] his efforts at the time, and ultimately concluded that the media was badly corrupted. Not knowing when to give up, I decided to give it a try and see if anything could be salvaged. As of this writing, and working with Michal, we’ve managed to achieve approximately 98% restoration of the product as it would have existed at the time.

Xenix 386 booted with uname

I’m going to write up the rather long and interesting quest of rebuilding this piece of history. I apologize in advance about the images in this article, but we only recently got serial functionality working again, and even then, early boot and installation has to be done over the console.

* - SCO in this case refers to the original Santa Cruz Operation, and not the later SCO Group who bought the name and started the SCO/Linux lawsuits.

Read more past the fold.

Historical Background

From a historical perspective, Xenix is interesting as it was one of the first (if not the first) operating systems to take advantage of Protected Mode on the iAPX 80286 without being hamstrung by lack of backwards compatibility. I’ve talked about the 286 before on SoylentNews, but to summarize, the 80286 was the first processor with Protected Mode. However, it didn’t support paging, and the switch from real mode (8086 compatibility) to protected mode was one way; there was no official way to return to real mode without restarting the processor, and neither DOS, nor BIOS could operate in Protected Mode. To my knowledge, it was the only operating system to adopt the view that a system would enter protected mode, and never return to 16-bit compatibility. As such, it’s implementation of protected mode is somewhat different than most people are familiar with.

Instead, the 80286 was intended to allow running legacy DOS applications in real mode, while people would upgrade to new protected mode operating systems and software. The much loathed real mode segmentation system was revamped as well due to the new 32-bit register size, and it was now possible to have segments up to 16 MiB (a tremendous amount of memory at the time) in size, allowing applications to operate with a de-facto flat memory model.

Correction: Wow, I went wrong here. The 80286's protected mode allowed segments to reside in a 24-bit address space, but were limited to 16-bits (64k) in length. The 80386 changed the rules to allow larger segment size by using additional fields on the LDT and GDT to extend the base, limit and a size modifer.

Additionally, Xenix was one of the most polished, and featureful UNIX systems of it's time. Out of the box, the system was the originator of virtual terminals, and supported both UUCP networking, and RS-232 serial based MicNet, and bridging between the two. MicNet appears to have been Microsoft's answer to AppleTalk as a very low cost networking solution, and allowed multiple systems to appear as one single UUCP node on the bang path. We'll explore both these features in later articles.

For software installation, Xenix's "custom" utility provided full featured package management, installation and removal, and even allowed per-file selection, relatively on par with modern Linux package management. Beside the stock operating system, Xenix had official add-on packages for international support, K&R based C compilers for DOS and Xenix, and a text processing system based on AT&T's troff. Third party solutions provided STREAMS and TCP/IP support before these features were added in Xenix 2.3.

System administration utilities for the most part were interactive, and easy to use, allowing for quick and easy setup of networking, printers, and user administration, and the system could dual boot with DOS. Combined with the visual shell, it's likely one of the best experiences you could get on a UNIX system of the era, and in many ways, still holds up today, nearly 30 years later. Microsoft was pushing Xenix heavily, and for a time, it was intended as the true replacement to the 16-bit DOS. However, fate intervened.

In 1984, following the break-up of Bell System (https://en.wikipedia.org/wiki/Breakup_of_the_Bell_System) into the baby bells, AT&T decided to enter the computer market and directly sell UNIX System V. Microsoft decided that they didn’t want to compete against AT&T, and began to collaborate with IBM to create what would become known as OS/2. In 1987, Microsoft transferred ownership of Xenix to the Santa Cruz Operation, and SCO began porting the operating system to take advantage of the 80386 and creating Xenix 386.

The most commonly known release of Xenix 386 is the 2.3, supported alongside the earlier Xenix 286 2.2 releases, and SCO’s Support Level Supplements simultaneously supported both releases. The SLS index only shows a single update for “Xenix 386 2.2.1-2.2.3” for UUCP, but an examination of that update shows that this appears to be a mislabeling, as the binaries it contained target the 286.

So, what exactly is this unusual 2.2.3c release then? To find that out, I needed to get the thing running.

Stumbling Towards Boot

The images floating around on the internet come in two forms, a set of TeleDisk TD0 images, and a group of raw 720 kilobyte raw images, suitable for use in a VM (or with dd). Much later in our recovery effort, we eventually determined that the TD0 images were the originals, and the raw images were later created from these.

Initially though, I just wanted to get it to start. The image files contained six operating system (known as N1-6) disks, “Basic Utilities” (B1-2) disks, “Extended Utilities” (X1-5) disks, three International Supplement disks, and a single games disk. An initial examination of the disks showed that N1 and N2 had a Xenix filesystem, and the rest were simply raw tar archives that I could extract with GNU tar (with some warnings). The vast majority of data looked intact, so I grabbed QEMU, and popped N1 in and booted it up.

N1 Boot

Unfortunately, the system would hang almost immediately after. Some testing revealed that the same issue existed on Bochs. PCjs got a bit further, but kernel panicked nearly immediately. Somewhat surprising to me though was VirtualBox not only booted, it got to the first step of the installer.

Language Selection

Some time later, I did discover the failure here, but I’ll save that story for another article :). */evil*

With the first hurdle passed, it wasn’t long before another problem reared its ugly head (more later). Unfortunately, shortly after that, the system would hang trying to partition /dev/hd0.

Partitioning Hang

Some trial and error showed that if I started the system up without any IDE drives, I could successfully get through to the partitioning screen. As I know Michal had gotten farther in his resurrection attempt, I dropped him an email, and began to dig into the both the boot hang, and the IDE driver, and get a debugging build of VirtualBox setup. As we exchanged emails, I learned Michal had not only found the IDE issue, he also had managed to extract a full set of debugging symbols and offsets, and some tips with using the VirtualBox debugger.

I’ll let him explain in his own words:

Hi Michael,

Here’s my analysis. The wd1010 driver in this version of Xenix is just plain wrong, and they were just lucky that it worked.

The problem is unquestionably with the INITIALIZE PARAMETERS command. The command is automatically executed by the _wdio routine if it finds that it hasn’t been done yet. All the code is in _wdio. It writes all the registers except for the command register. Then it potentially executes a loop which writes the command register and immediately reads the status register. If the error bit is set, the command is written again and the loop repeats until the error register is not set.

What happens in VirtualBox is that reading the status register clears the interrupt triggered by INITIALIZE PARAMETERS. That is the correct behavior, because reading the status register is *supposed* to clear the interrupt. Now at this point the CPU runs with interrupts enabled, but the disk interrupt is masked because the driver executed _spl5 further up the call stack in _wdstrategy. The interrupt is cleared from the device and from the controller, and the OS never receives it.

But the OS relies on the interrupt. It’s supposed to execute _wdintr, notice that INITIALIZE PARAMETERS was executed, set up a RECALIBRATE command into _wdjob and call _wdio again to continue with I/O. Once the interrupt from RECALIBRATE is processed, _wdjob is set up with a read or a write command, _wdio is executed, and the actual I/O happens.

Because the interrupt is cleared too soon, the state machine breaks down and the OS just sits there totally idle because it has nothing to do.

It appears that in old drives, INITIALIZE PARAMETERS [took] some non-negligible time to execute and reading the status register right after writing the command did not clear the interrupt because the command hadn’t yet set it. But then it is wrong to read the status register because if the command is going to fail, it’s probably going to take some time to fail, too.

This would be solved by making INITIALIZE PARAMETERS take a millisecond or two to complete. It is probably much easier to patch Xenix to do what it should have been doing all along, i.e. reading the alternate status register (3F6h instead of 1F7h) which does not clear interrupts.

A 30-Year Old Bug

For those less versed in ATA/IDE interfaces, let me translate this into more basic English. On x86 compatible machines, access to the hard drive is controlled via a dedicated hard-drive controller and managed via the port I/O interface on the process (using in/out opcodes). ATA commands are written to these registers. In this case, Xenix is sending the INITIALIZE PARAMETERS command which brings the drive out of reset, and sets up the addressing mode.

The designers of the ATA specification designed it such a way that I/O operations can be asynchronous; the CPU sends a command, and then goes to do something else. When the hard drive is ready for more, it raises an interrupt, telling the processor to send another command. This interrupt is cleared by reading from the primary status register at 0x1F7. This behavior is by design and has been a part of the ATA specification since day one. In some cases however, one may simply want to poll the drive to know its status without changing interrupt statuses. For this purpose, an alternate status register at 0x3F7 is provided.

Xenix uses lazy initialization; that is to say that a device isn’t initialized until it’s used; the wd driver is never executed until something accesses /dev/hd0, and thus why it hangs at partitioning and not during IPL. When fdisk starts, the wd driver attempts to initialize the drive, and immediately reads the status register to check for any possible error codes. Afterwards, it waits for the IDE controller to generate an interrupt letting it know the drive is ready. In doing so, Xenix clears the interrupt it would get from the INITIALIZE PARAMETERS command, and gets stuck in a spinloop. As such, the hang is caused by a legitimate bug in Xenix in its IDE implementation and can occur on real hardware.

It’s hard to say if this was actually a problem in 1987, however, older releases of Xenix were known to be incredibly picky about the hardware they would work on, and prevailing logic on USENET was that older releases of Xenix would flat out break on any processor faster than 50 Mhz, partially due to bugs like this. However, Xenix 2.3 (which was released not long after this version) rewrote the wd driver to not suffer from this race condition, so it likely was as much a problem then as it was now. As Michal noted, its possible to read the status register without clearing the interrupt, and get the behavior Xenix wants. One quick hex edit later, and I now get this.

Disk Geometry Select Partition Finishing up

Success! Due to the fact that it uses CHS (Cylinder, Head, and Sector) addressing and bypasses the BIOS, Xenix tops out at a maximum drive size of 504 MiB. After a few basic questions, I’m prompted to remove N2, and reboot.

Reboot

N1 goes back in as per the instructions, I cross my fingers, push Enter and …

Dreaded Z Hang

It hangs. Crud.

In our next installment, we'll go into trying to manually start the operating system when the only commands we have are tar, mount, dd, and sh, along with the Xenix manifest files, and thereby crash head first into Xenix's copy protection.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by edIII on Monday March 06 2017, @07:24PM

    by edIII (791) on Monday March 06 2017, @07:24PM (#475760)

    Same here. I didn't even know of Xenix until I got up this morning and started reading SN. This is very cool stuff.

    --
    Technically, lunchtime is at any moment. It's just a wave function.
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2