Stories
Slash Boxes
Comments

SoylentNews is people

posted by NCommander on Monday March 06 2017, @05:00PM   Printer-friendly
from the adventures-in-data-recovery-and-30-year-old-bugs dept.

One of my favorite hobbies is both retrocomputing projects, and the slightly more arcane field of software archeology; the process of restoring and understanding lost or damaged pieces of history. Quite a while ago, I came across the excellent OS/2 Museum, run by Michal Necasek which helps categorize many of the more obscure bits of PC history, including a series of articles about Xenix, Microsoft’s version of SVR2 UNIX.

What caught my attention were two articles talking about Xenix 386 2.2.3c, a virtually undocumented release that isn’t mentioned in much if any of the Santa Cruz Operation's (SCO, but see footnote) surviving literature of the time. Michal documented [1], [2] his efforts at the time, and ultimately concluded that the media was badly corrupted. Not knowing when to give up, I decided to give it a try and see if anything could be salvaged. As of this writing, and working with Michal, we’ve managed to achieve approximately 98% restoration of the product as it would have existed at the time.

Xenix 386 booted with uname

I’m going to write up the rather long and interesting quest of rebuilding this piece of history. I apologize in advance about the images in this article, but we only recently got serial functionality working again, and even then, early boot and installation has to be done over the console.

* - SCO in this case refers to the original Santa Cruz Operation, and not the later SCO Group who bought the name and started the SCO/Linux lawsuits.

Read more past the fold.

Historical Background

From a historical perspective, Xenix is interesting as it was one of the first (if not the first) operating systems to take advantage of Protected Mode on the iAPX 80286 without being hamstrung by lack of backwards compatibility. I’ve talked about the 286 before on SoylentNews, but to summarize, the 80286 was the first processor with Protected Mode. However, it didn’t support paging, and the switch from real mode (8086 compatibility) to protected mode was one way; there was no official way to return to real mode without restarting the processor, and neither DOS, nor BIOS could operate in Protected Mode. To my knowledge, it was the only operating system to adopt the view that a system would enter protected mode, and never return to 16-bit compatibility. As such, it’s implementation of protected mode is somewhat different than most people are familiar with.

Instead, the 80286 was intended to allow running legacy DOS applications in real mode, while people would upgrade to new protected mode operating systems and software. The much loathed real mode segmentation system was revamped as well due to the new 32-bit register size, and it was now possible to have segments up to 16 MiB (a tremendous amount of memory at the time) in size, allowing applications to operate with a de-facto flat memory model.

Correction: Wow, I went wrong here. The 80286's protected mode allowed segments to reside in a 24-bit address space, but were limited to 16-bits (64k) in length. The 80386 changed the rules to allow larger segment size by using additional fields on the LDT and GDT to extend the base, limit and a size modifer.

Additionally, Xenix was one of the most polished, and featureful UNIX systems of it's time. Out of the box, the system was the originator of virtual terminals, and supported both UUCP networking, and RS-232 serial based MicNet, and bridging between the two. MicNet appears to have been Microsoft's answer to AppleTalk as a very low cost networking solution, and allowed multiple systems to appear as one single UUCP node on the bang path. We'll explore both these features in later articles.

For software installation, Xenix's "custom" utility provided full featured package management, installation and removal, and even allowed per-file selection, relatively on par with modern Linux package management. Beside the stock operating system, Xenix had official add-on packages for international support, K&R based C compilers for DOS and Xenix, and a text processing system based on AT&T's troff. Third party solutions provided STREAMS and TCP/IP support before these features were added in Xenix 2.3.

System administration utilities for the most part were interactive, and easy to use, allowing for quick and easy setup of networking, printers, and user administration, and the system could dual boot with DOS. Combined with the visual shell, it's likely one of the best experiences you could get on a UNIX system of the era, and in many ways, still holds up today, nearly 30 years later. Microsoft was pushing Xenix heavily, and for a time, it was intended as the true replacement to the 16-bit DOS. However, fate intervened.

In 1984, following the break-up of Bell System (https://en.wikipedia.org/wiki/Breakup_of_the_Bell_System) into the baby bells, AT&T decided to enter the computer market and directly sell UNIX System V. Microsoft decided that they didn’t want to compete against AT&T, and began to collaborate with IBM to create what would become known as OS/2. In 1987, Microsoft transferred ownership of Xenix to the Santa Cruz Operation, and SCO began porting the operating system to take advantage of the 80386 and creating Xenix 386.

The most commonly known release of Xenix 386 is the 2.3, supported alongside the earlier Xenix 286 2.2 releases, and SCO’s Support Level Supplements simultaneously supported both releases. The SLS index only shows a single update for “Xenix 386 2.2.1-2.2.3” for UUCP, but an examination of that update shows that this appears to be a mislabeling, as the binaries it contained target the 286.

So, what exactly is this unusual 2.2.3c release then? To find that out, I needed to get the thing running.

Stumbling Towards Boot

The images floating around on the internet come in two forms, a set of TeleDisk TD0 images, and a group of raw 720 kilobyte raw images, suitable for use in a VM (or with dd). Much later in our recovery effort, we eventually determined that the TD0 images were the originals, and the raw images were later created from these.

Initially though, I just wanted to get it to start. The image files contained six operating system (known as N1-6) disks, “Basic Utilities” (B1-2) disks, “Extended Utilities” (X1-5) disks, three International Supplement disks, and a single games disk. An initial examination of the disks showed that N1 and N2 had a Xenix filesystem, and the rest were simply raw tar archives that I could extract with GNU tar (with some warnings). The vast majority of data looked intact, so I grabbed QEMU, and popped N1 in and booted it up.

N1 Boot

Unfortunately, the system would hang almost immediately after. Some testing revealed that the same issue existed on Bochs. PCjs got a bit further, but kernel panicked nearly immediately. Somewhat surprising to me though was VirtualBox not only booted, it got to the first step of the installer.

Language Selection

Some time later, I did discover the failure here, but I’ll save that story for another article :). */evil*

With the first hurdle passed, it wasn’t long before another problem reared its ugly head (more later). Unfortunately, shortly after that, the system would hang trying to partition /dev/hd0.

Partitioning Hang

Some trial and error showed that if I started the system up without any IDE drives, I could successfully get through to the partitioning screen. As I know Michal had gotten farther in his resurrection attempt, I dropped him an email, and began to dig into the both the boot hang, and the IDE driver, and get a debugging build of VirtualBox setup. As we exchanged emails, I learned Michal had not only found the IDE issue, he also had managed to extract a full set of debugging symbols and offsets, and some tips with using the VirtualBox debugger.

I’ll let him explain in his own words:

Hi Michael,

Here’s my analysis. The wd1010 driver in this version of Xenix is just plain wrong, and they were just lucky that it worked.

The problem is unquestionably with the INITIALIZE PARAMETERS command. The command is automatically executed by the _wdio routine if it finds that it hasn’t been done yet. All the code is in _wdio. It writes all the registers except for the command register. Then it potentially executes a loop which writes the command register and immediately reads the status register. If the error bit is set, the command is written again and the loop repeats until the error register is not set.

What happens in VirtualBox is that reading the status register clears the interrupt triggered by INITIALIZE PARAMETERS. That is the correct behavior, because reading the status register is *supposed* to clear the interrupt. Now at this point the CPU runs with interrupts enabled, but the disk interrupt is masked because the driver executed _spl5 further up the call stack in _wdstrategy. The interrupt is cleared from the device and from the controller, and the OS never receives it.

But the OS relies on the interrupt. It’s supposed to execute _wdintr, notice that INITIALIZE PARAMETERS was executed, set up a RECALIBRATE command into _wdjob and call _wdio again to continue with I/O. Once the interrupt from RECALIBRATE is processed, _wdjob is set up with a read or a write command, _wdio is executed, and the actual I/O happens.

Because the interrupt is cleared too soon, the state machine breaks down and the OS just sits there totally idle because it has nothing to do.

It appears that in old drives, INITIALIZE PARAMETERS [took] some non-negligible time to execute and reading the status register right after writing the command did not clear the interrupt because the command hadn’t yet set it. But then it is wrong to read the status register because if the command is going to fail, it’s probably going to take some time to fail, too.

This would be solved by making INITIALIZE PARAMETERS take a millisecond or two to complete. It is probably much easier to patch Xenix to do what it should have been doing all along, i.e. reading the alternate status register (3F6h instead of 1F7h) which does not clear interrupts.

A 30-Year Old Bug

For those less versed in ATA/IDE interfaces, let me translate this into more basic English. On x86 compatible machines, access to the hard drive is controlled via a dedicated hard-drive controller and managed via the port I/O interface on the process (using in/out opcodes). ATA commands are written to these registers. In this case, Xenix is sending the INITIALIZE PARAMETERS command which brings the drive out of reset, and sets up the addressing mode.

The designers of the ATA specification designed it such a way that I/O operations can be asynchronous; the CPU sends a command, and then goes to do something else. When the hard drive is ready for more, it raises an interrupt, telling the processor to send another command. This interrupt is cleared by reading from the primary status register at 0x1F7. This behavior is by design and has been a part of the ATA specification since day one. In some cases however, one may simply want to poll the drive to know its status without changing interrupt statuses. For this purpose, an alternate status register at 0x3F7 is provided.

Xenix uses lazy initialization; that is to say that a device isn’t initialized until it’s used; the wd driver is never executed until something accesses /dev/hd0, and thus why it hangs at partitioning and not during IPL. When fdisk starts, the wd driver attempts to initialize the drive, and immediately reads the status register to check for any possible error codes. Afterwards, it waits for the IDE controller to generate an interrupt letting it know the drive is ready. In doing so, Xenix clears the interrupt it would get from the INITIALIZE PARAMETERS command, and gets stuck in a spinloop. As such, the hang is caused by a legitimate bug in Xenix in its IDE implementation and can occur on real hardware.

It’s hard to say if this was actually a problem in 1987, however, older releases of Xenix were known to be incredibly picky about the hardware they would work on, and prevailing logic on USENET was that older releases of Xenix would flat out break on any processor faster than 50 Mhz, partially due to bugs like this. However, Xenix 2.3 (which was released not long after this version) rewrote the wd driver to not suffer from this race condition, so it likely was as much a problem then as it was now. As Michal noted, its possible to read the status register without clearing the interrupt, and get the behavior Xenix wants. One quick hex edit later, and I now get this.

Disk Geometry Select Partition Finishing up

Success! Due to the fact that it uses CHS (Cylinder, Head, and Sector) addressing and bypasses the BIOS, Xenix tops out at a maximum drive size of 504 MiB. After a few basic questions, I’m prompted to remove N2, and reboot.

Reboot

N1 goes back in as per the instructions, I cross my fingers, push Enter and …

Dreaded Z Hang

It hangs. Crud.

In our next installment, we'll go into trying to manually start the operating system when the only commands we have are tar, mount, dd, and sh, along with the Xenix manifest files, and thereby crash head first into Xenix's copy protection.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Monday March 06 2017, @07:00PM (7 children)

    by Anonymous Coward on Monday March 06 2017, @07:00PM (#475744)

    Intel wrote an RTOS called RMX. It was probably running on the 80286 before Xenix, seeing as it came from Intel. You can still buy a copy. It runs the London Underground real-time train control. Lots of binaries are still 16-bit. The kernel obviously switched first; perhaps this is the case with Xenix.

    https://en.wikipedia.org/wiki/RMX_(operating_system) [wikipedia.org]

    One can run 16-bit binaries on 32-bit hardware. You can run more of them, or have more filesystem cache, if the kernel is aware of the 32-bit hardware. The first enhancement would likely be to set the granularity bit in the segment descriptors for code and data. This allows segments above the old 16-megabyte limit. After that you can allow larger segment limits, allow 32-bit code, and/or enable the page tables.

  • (Score: 2) by NCommander on Monday March 06 2017, @07:15PM (6 children)

    by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Monday March 06 2017, @07:15PM (#475755) Homepage Journal

    16-bit is a bit complicated in this situation since in the case of Xenix, we're dealing with the 286's version of protected mode, and not the 386's version of it. (Xenix 386 2.2.3 *does*; cr0 byte 32 is set, and a paging table is installed; confirmed by debugger) setup paging, but as far as I can tell, its unused. It also does some checks for known 80386 B0 erreta such as the IMUL stepping bug.

    Real mode can essentially be considered a 20-bit architecture because it's a 16-bit base selector, and a 16 bit offset, for 20 bits total, both of are mapped to physical memory addresses. 286 protected mode on the other hand is a 24-bit architecture (16 MiB), due to the limitations of the GDT and LDT; this is broadly similar to the m68k situation on Mac's where while the processor was 32-bit register size, the address lines were 24-bit, and created an ugly world where you could have 32-bit incompatibilities if software did stupid crap like tagged pointers.

    It *is* actually possible to run real-mode code unmodified in protected mode, or at least under segmented protected mode as long as the real mode binary followed specific rules. There's an entire section in the iAXP 286 Programmers Reference discussing running legacy code in real mode; OS/2 could do the same thing; 16-bit code running on protected mode. It's not however possible to use DOS, or BIOS calls under this. As such, Xenix 386 maintained backwards compatibility with its PC/XT, AT, and 286 versions.

    --
    Still always moving
    • (Score: 0) by Anonymous Coward on Tuesday March 07 2017, @06:33AM (5 children)

      by Anonymous Coward on Tuesday March 07 2017, @06:33AM (#475927)

      There are reasons to set up paging, even if it mostly isn't used.

      For example, RMX for the 80386 sets up paging, but the paging only gets used if you run flat-model 32-bit executables. If you just run segmented executables (either 16-bit or 32-bit) the paging doesn't really get used.

      Xenix could be similar: you happen to only have 16-bit segmented executables, but the capability exists to run other types.

      Paging can also be used to extend an OS past 16 megabytes of memory without fixing code that would break if the granularity bit were set in the segment descriptors. The OS might internally depend on being able to position segments such that they are not 4096-byte aligned. In fact, this is probable. Many OSes of this era do something truly horrifying: they call into normal 16-bit BIOS functions from 16-bit protected mode. This was sort of workable, given that there were few BIOS vendors (just IBM and Compaq) and one could disassemble the BIOS to see if it would be safe to call. Access to the BIOS data area at 0040:0000 would involve segment 0x40 with a base of 0x400. Use of the NULL segment didn't happen much in those BIOSes, generally happening before entry to protected mode. Use of A20 wrap-around could be handled via faults, via paging (80386 and better), or via actually messing with A20.

      • (Score: 2) by NCommander on Tuesday March 07 2017, @06:53AM (4 children)

        by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Tuesday March 07 2017, @06:53AM (#475931) Homepage Journal

        Xenix actually is one of the few operating systems that ran into a forward compatibility issue; Xenix 286 will not run on a 386 due to the fact that it loads bad values into the last two words of the LDT which are normally unused in that processor. That causes bad juju on the 386 which uses that for the increased segment size parameters.

        There are a few specific 80386 binaries on the disks, but none appear to use paging capabilities as far as I can tell. I haven't done an in-depth analysis to see if it does actually page those binaries or simply takes advantage of the larger segment size. I don't have a set of development tools that match this version of Xenix, but the Xenix 2.3 386 tools do work on this release, so I could probably try compile a paged binary and see if it works.

        Xenix (several versions) also overrides the low memory segment on startup at 0x40, blowing away the EBDA while its at it. This is a hint as to why it refuses to boot on specific emulators and on some machines depending on the BIOS. I haven't checked if it does BIOS calls to gather data before switching to protected mode, or if does some sorta segment magic to do them once leaving real mode. Notably, at least this version of Xenix does NOT setup a V86 task in the TSS to talk to the BIOS (2.3 might).

        iAXP 286 programmer's manual actually has an entire section talking about running real mode code within protected mode and how it can safely be done. I won't be surprised if BIOSes of the era were somewhat designed with this use case in mind, even though it was never formally standardized.

        --
        Still always moving
        • (Score: 0) by Anonymous Coward on Tuesday March 07 2017, @08:33AM (3 children)

          by Anonymous Coward on Tuesday March 07 2017, @08:33AM (#475943)

          By "horrible", I mean it. OSes of that era would call the BIOS without setting the V86 bit. They'd just... load some segment and other registers then do a far call, or they'd do a software int, or they'd do an iret. There might even be data segments that are NOT set up as in real mode, and the BIOS is expected to work with them. You might wonder how this interacts with DMA, for example if you ask the BIOS to do floppy IO while the segments are not real-mode compatible. One solution is to make BIOS calls in ring 3 and virtualize the DMA controller by catching the "in" and "out" instructions as faults.

          Free emulators have gone through a BIOS change. Most now use seabios, which forked off of the old Bochs BIOS to convert the code to be compatible with gcc and gas. The old Bochs BIOS used bcc and as86, same as the old Linux boot sector. Seabios is a modern PCI BIOS that requires 32-bit instruction support. Seabios requires PCI video, requires that 0xc0000 to 0xfffff be RAM (instead of ROM or ISA MMIO), and seems to have limited support for being called from 16-bit protected mode. The old Bochs BIOS has none of those problems and fits in less space. It can support ISA MMIO for stuff like multi-port serial, SCSI, and telephone voice interfaces. This is not to say that the old Boch BIOS is problem-free. It does not maintain the "alt" key modifier bit in the EBDA. Neither free BIOS supports the Alt-SysRq hooks for OS switching.

          BTW, looking to hire people who can understand this kind of stuff

          • (Score: 1) by kvaltyr on Tuesday March 07 2017, @02:35PM (1 child)

            by kvaltyr (6512) on Tuesday March 07 2017, @02:35PM (#476022)

            Any further info? Obviously not NCommander here and I'm not super familiar with the specific vagaries of DOS / early x86 processors but I am a competent reverse engineer and have experience with 32/64 bit x86 assembly and some embedded stuff. Would be happy doing anything that isn't web development.

            • (Score: 0) by Anonymous Coward on Tuesday March 07 2017, @07:24PM

              by Anonymous Coward on Tuesday March 07 2017, @07:24PM (#476140)

              That experience is about perfect. The work varies quite a bit, involving many kinds of CPU and OS from the ancient to the very latest, but is nearly always low-level. We do lots of reverse engineering. We write emulators. We use assembly code. It's USA only, no H1B or greencard. You get extreme flex-time without any expectation of overtime. Skill requirements are kind of vague because the work varies; we hire enough people to cover everything.

              So, hmmm, I have a gmail account that I guess is already exposed to spammers. I'm acahalan. Some sort of resume would be good, but we can chat about things first if you like; many people leave valuable low-level things off of their resumes because the web shops don't appreciate it properly.

          • (Score: 2) by NCommander on Tuesday March 07 2017, @03:56PM

            by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Tuesday March 07 2017, @03:56PM (#476045) Homepage Journal

            Oh I'm aware of the type of abuse old OSes did. I'm just saying that Xenix 386, being designed for the 386, could have used Virtual 8086 mode to talk to the BIOS if it wanted it. It didn't. VirtualBox uses OpenWatcom to build its BIOS (which is a fork of SeaBios with some duck tape to let it work with some older OSes; I'd probably use wcc16 over bcc just because I get support for slightly more modern C standards).

            I'd be interested in hearing what kind of work you're offering, my email is listed publicly, just put a subject that makes sure I don't think its spam (put Xenix or something). I worked professionally on TianoCore on AArch64, and I consider myself at least semi-proficient in real mode x86 though I never worked with it on a professional basis.

            Notably, now that I'm actually awake, you could have theoretically made a fully protected mode of the era compliant BIOS either by requiring the base operating system set aside a specific GDT selector(s) for it, or making your entry points only work off CS and fit with near pointers. As a last resort, the BIOS could have always done GDTR, changed the protected mode table via LGDT, and restored it on the way out.

            For modern emulation/virtualization, you could always handle some bits of "magic" by simply catch the calls to/from the BIOS, and if in protected mode, and doing magic to make sure the OS gets what it wants.

            --
            Still always moving