(This is the 23rd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
There comes a point during network protocol development when someone decides to aggregate requests and/or acknowledgements. Don't do this. For acknowledgements, it will screw statistical independence of round-trips. It may also set implicit assumptions about the maximum rate of packet loss in which a protocol may work.
In the case of requests, aggregation may be a huge security risk. When I first encountered this problem, I didn't know what I was facing but, instinctively, it made me very uneasy. The problem was deferred but not resolved. A decision was made to implement a UDP server such that a one packet request led to zero or one packets in response. (This ignores IPv4 fragmentation and/or intended statistical dependence of multiple responses.)
In the simplest form, every request generates exactly one response. This greatly simplifies protocol analysis. It is only complicated by real-world situations, such as packet loss (before or after reaching a server) and crap-floods (accidental or intentional). Some of this can be handled with a hierarchy of Bloom filters but that's a moderate trade of time versus state.
On the server, there was great concern for security. In particular, logging was extensive to the point that a particular code-path may set a bit within a response code which was logged but not sent to a client. Indeed, logging was extensive to the point that there was a collection of log utilities around the server; performing basic tasks, such as teeing the primary text log and bulk inserting it into an indexed, relational OLAP database. (A task that systemd has yet to achieve with any competence or integrity.) The importance of doing this correctly allowed text logs to be compressed and archived while simultaneously allowing real-time search while simultaneously allowing a lack of bulk inserts to raise a warning over SMS.
However, the asymmetry between the size of request packets and response packets created pressure to aggregate requests. This was particularly pressing on a kernel, such as MacOSX 10.6, which limited UDP buffers to a total of 3.8MB.
The Heartbleed attack left me extremely mixed. I was concerned that my ISPs would be hacked. I was relieved that it didn't increase my workload. It also resolved my unease. That was not immediately apparent. However, SSL negotiation begins with an escape sequence out of HTTP. From there, a number of round-trips allow common ciphers to be established and keys to be exchanged. Unfortunately, none of this process is logged. This was a deliberate design decision to maintain a legacy log format and implement ciphers with a loosely inter-operable third-party library which has no access to the server's log infrastructure.
This immediately brought to mind an old EngEdu Aspect Orientation presentation. Apparently, logging is a classic use case for aspect orientated programming. (People get stuck on efficient implementation of aspects but I believe techniques akin to vtable compression can be applied to aspects, such as Bloom filters. A more pressing problem is register allocation for return stack, exception stack and other state. Perhaps pushing the state of a Bloom filter of dynamic vtable addresses on a return stack can double as a guard value?)
Anyhow, the current HTTPS implementations completely fail to follow good practice. In particular, if each round-trip of negotiation was logged, it would be possible to find clients doing strange things and failing to connect. And with appropriate field formats, truncated strings would not be accepted. Ignoring all of this, an HTTPS log entry is a summary of a transaction whereas we want each stage. How would this apply to aggregated UDP requests?
An inline sequence of requests invite problems because the retrospective concatenation of requests may not be isolated in all cases. Having a request type which is a set of requests is no better. Firstly, it is not possible to prevent third-parties from ever implementing such a request type. Secondly, it is not possible to prevent third-parties from ever accepting nested request sets. Perhaps it would be better to make the core protocol handle a set of requests? Erm, why are we adding this bloat to every request when it does not preclude the previous cases?
And there's the core problem. It is absolutely not possible to prevent protocol extensions which are obviously flawed to anyone who understands software architecture. Nor is it possible to prevent more subtle cases.
(This is the 22nd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I've noted that a small fixed length cell format is viable for communication between computers. What's the fascination with this task?
For these reasons, it'll get used even in circumstances where it seems like a really tortuous choice.
The first place where it will get used is between a host computer and a speaker array's sound processor. For development, this will use an Arduino Due. This oddly named device contains an 84MHz ARM Cortex M3 processor. Unfortunately, it also comes with a hateful development environment. Furthermore, support code (boot-loader, libraries) is supplied under differing licences. Most critically, the fast USB interface is largely undocumented and data transfer may be initially implemented over a virtual serial port.
Although it seems mad to send 32 byte, bit stuffed, fixed length cells over a virtual serial port over USB, it has the following advantages:-
There is one horrible exception to using the cell structure everywhere. By volume, the typical case for data transfer is a host computer sending sound samples to sound array processor. This communication occurs in one direction over USB. When data is received by a USB interface, it may transferred to main memory at a specified address. It may be an advantage to perform this such that:-
This should be sufficient to implement many of the basic requirements of a speaker array within the available processing power.
(This is the 21st of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
It is possible to tunnel arbitrarily long packets over fixed length cell networks. This works with conventional error detection and error correction schemes. Even if you dispute this, I'd like to describe my preferred, concise implementation.
A 24 byte fixed length cell is significantly smaller than ATM's 53 byte fixed length cell. However, 24 bytes is sufficient for reliable signaling, voice communication, slow-scan video and encapsulation of other network protocols. Furthermore, 24 bytes after 4B5B (or suchlike) bit stuffing is 240 bits. With the addition of a 16 bit cell frame marker, we have 256 bits. So, 24 bytes get encoded as 32 bytes. Over high speed networking, this can be implemented with an eight bit binary counter. Over low speed networking, multiple channels can be bit-banged in parallel.
There exists a low overhead method for bit stuffing. There exists a cell frame marker which always violates bit stuffing. Therefore, nothing in payload can immitate a frame marker. Therefore, the system is fairly immune to packet-in-packet attack without further consideration.
Addressing may be performed with a routing tag within each cell and a source address and/or destination address within each packet. Partial decode of the bit stuffing allows cells to be routed without decoding or encoding contents in full. Combined with techniques such as triple-buffering, each channel requires no more than 96 bytes excluding pointers and one common decode buffer. Eight channels require less than 1KB including pointers and common state. Therefore, it may be possible to implement cell networking on very basic hardware. This includes an eight bit micro-controller with less than 1KB RAM. Furthermore, it is possible to perform routing of packets which exceed 1KB via such a device.
However, more resources are required to perform security functions. In particular, key exchange and hash verification is very likely to require fields which exceed a 24 byte cell. Therefore, secure end-to-end communication with a leaf node requires a device which can unpack a payload which spans multiple cells. It remains desirable to implement triple-buffering at this level but it is also desirable to have an MTU which greatly exceeds 1KB. This is in addition to cryptography state, entropy state and application state. Despite these constraints, it is possible for a system with 64KB RAM to provide secure console and graphical services which are extremely tolerant to packet loss.
(This is the 20th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
While I await components for a toy robot, a somewhat toy 52 Watt quadcopter and a very serious 3D speaker array system, I'll explain some previous research which is not widely known. I'll start with an introduction, move to implementation details, limitations and a possible solution.
Cells
Local Area Networks typically use variable length packets. Our dependence upon wired Ethernet and wireless Ethernet is so widespread that people have difficulty imagining any other techniques. However, while Ethernet dominates short spans of network, long-distance connections invariably use fixed length cells. This includes the majority of digital satellite communication, cell-phones and broadband systems. The division between LAN and WAN [Local Area Networking and Wide Area Networking] remains very real due to pragmatic reasons.
Variable length packets maximize bandwidth but fixed length cells maximize reliability. For long-distance communication, reliability is more important than bandwidth. Indeed, without reliability over a long span, bandwidth is zero.
Framing
Finding the start of a packet is difficult and cumbersome. Typically, there is a start sequence. This applies even in the trivial case of RS-232 serial communication in which start bits, stop bits, payload bits and parity bits are all configurable. For Ethernet, the start sequence is a known pattern of eight bytes. For Zigbee, it is a shorter pattern of four bytes. This inefficient but acceptably so within the span of a broadcast network. However, pre-amble is highly insecure because the patterns used in the start sequence are also valid patterns within a payload. This makes Ethernet, Zigbee and many other protocols vulnerable to packet-in-packet attack. I've described the danger of packet-in-packet attack applied to AFDX but this is often met with dis-belief.
Cell structures don't have this problem because data is inserted between a regular spacing of boundary markers. Admittedly, this arrangement is superior for continuous point-to-point links; especially when a continuous stream of empty padding cells is sent to maintain a link. This isn't an option for burst protocols between many nodes but it is very useful between long-term partners.
Cell boundary markers work in a similar fashion to NTSC horizontal sync markers and allow transmitter and receiver to stay on track over long durations. In the very worst case, the marker pattern should never be off by more than one bit from its expected position. (Any difference occurs because the transmitter oscillator and the receiver oscillator run at very similar frequencies but not identical frequencies.) Furthermore, when this case occurs, the contents of a cell is known to be suspect.
Adaption
It is very useful to transfer packets over cells. This is achieved by fragmenting packets across multiple cells. In this case, a proportion of a cell is required to indicate the first fragment, the last fragment and/or a fragment number. For the remainder of the cell, each packet begins at a cell payload boundary. The last cell of a set may be padded with zeros. This ensures that the next packet begins at the next cell payload boundary.
A common case is also a worked example. If a single byte payload is sent over TCP/IPv4 over PPPoA over AAL5 [ATM Adaption Layer 5] then packet length is typically 41 bytes (40 byte IPv4 header, 40 byte TCP header, one byte payload) and cell size is 48 bytes. This would appear to fit into one cell. However, AAL5 encapulation overhead is 6-8 bytes per cell plus one bit(!) within ATM's five byte header. Anyhow, in this scenario, a packet may be split at the end of its TCP header. In typical cases, TCP/IPv4 packets fragment over PPPoA over AAL5. In all cases, TCP/IPv4 packets fragment over PPPoE over AAL5. However, there is a large amount of inefficiency with this arrangement. It almost shouts "This is badly implemented! Fix me!"
(Technical details are taken from an expired patent application and are therefore public domain.)
A consideration of AAL5's large headers and a consideration of cell re-fragmentation across multiple cell networks led to the insight that cells within a set should be sent backwards with a cell number. This arrangement permits several tricks with buffers and state machines. In particular, it permits simple low-bandwidth implementations and optimized high-bandwidth implementations. In all cases, receipt of fragment zero indicates that the final cell within a set has been received. A counter may determine if fragments are missing. FEC may be performed and/or the next protocol layer may be informed of holes.
Where cells are re-fragmented, a decrementing cell offset allows arbitrary re-fragmentation without knowing the exact length of encapsulated payload. (This incurs up to one cell of additional padding per re-fragmentation.) In some cases, length of bridged headers may not be known either. This occurs at near-line-speed without receipt of a full packet.
Further consideration of payload size and resilience leads to an encapulation header with two or three BER fields. The first field is the fragment number. This can be partially decoded without consideration of packet type. The second field is packet type. For trivial decoders, this acts as a rip-stop for a corrupted fragment number. In the case of frag=0, a third field is present. This is the exact payload length. Many protocols which could be encapsulated (Ethernet, IPv4, IPv6, TCP, UDP) already have one or more payload lengths. This field covers trivial protocols which don't specify their own length. It is typically no more than two bytes. (For BER, this allows values up to 2^14-1 - which is 16383 bytes.) It can be de-normalized with leading zeros but this introduces significant inefficiency in boundary cases. Specifically, a de-normalized byte may incur one extra cell. However, during particular cases of re-fragmentation, there may be boundary cases where the payload length field is itself an ambiguous length. In this case, it must be assumed to be maximum length. Foreseeably, this leads to inefficiency when handling particular packet lengths.
This arrangement allows arbitrary packets to be tunnelled over protocols ranging from CANBus (8 bytes) to TeTRa (10-16 bytes) to ATM (48 byte) to USB (512-1024 bytes).
A variant of this arrangement permits Ethernet over a two byte payload. This is an extreme example which remains impractical even if you convince someone that it is possible. A 16 bit payload can be divided into a variable length fragment number field and a variable length payload field. If the first bit is zero then 10 bits represent fragment number and five bits represent payload. If the first bit is one then 11 bits represent fragment number and four bits represent payload. It is implicit that all 10 bit fragment numbers precede all 11 bit fragment numbers. Regardless, this is sufficient to send Ethernet over two byte payloads in a manner which is optimized for shorter Ethernet packets. Jumbo Ethernet over CANBus is suggested as an exercise.
(This is the 19th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I'm hoping to create a system which can drive up to 32 speakers for US$300. The system is intended for playback of high-quality, pre-recorded media but would also be suitable for highly lossy streaming. Where a client is able to prioritize data retrieval, it is possible to sacrifice three dimensional sound then two dimensional sound then one dimensional sound (stereophonic) before sacrificing frequency response of zero dimensional sound (monophonic). This provides the most immersive experience with the least bandwidth. In favorable circumstances, only 5% of the data is critical and this may survive in an environment of 70% packet loss. This is far outside the range of TCP window-scaling which has become the favored distribution mechanism of multi-national corporations (and those who would emulate them).
After gathering requirements and devising a specification, I expect to receive a clone Arduino Due and five Microchip MCP4921 12 bit SPI DACs. The former is likely to irritate me and the latter will sound horrible; especially if they only bias a speaker in one direction. However, this is sufficient for testing. Assuming I don't destroy more hardware, I'll have enough hardware to test a five speaker, 3D surround sound system. Assuming hardware doesn't wear at an alarming rate, I expect to have something working by Sep 2017. Unfortunately, such an estimate may be optimistic. Although suitable hardware is not expected within the next 10 days, software development can begin.
I don't have a 3D recording system and so test data is going to be extremely limited. It is likely to be a batch conversion from monophonic or stereophonic WAV to Ambisonic WXYZ format implicitly held in a quadraphonic WAV. Conversion may perform effects such as soundstage rotation. This would be followed by a stub implementation which streams a quadraphonic WAV to an Arduino. And how will this be achieved? Well, it definitely won't integrate as a sound device which inter-operates with a host operating system. (If your operating system doesn't already support sound-fields then such integration will, at best, present a two dimensional 7.1 interface but is more likely to appear as a stereophonic device.) It is likely that the Arduino won't appear as a dedicated USB device. From documentation, it appears that the interface will be a virtual serial port over USB. This may be sufficient for testing but it won't be suitable for deployment.
In the general case, few liberties can be taken with a virtual serial port and therefore extensive framing of data is required. It can be assumed that transfers are contiguous up to 512 bytes. However, this may be more apparent to a device using Serial::readBytes() rather than a host using POSIX read(). The majority of data is transferred from host to device and it may be possible to relax framing constraints in this direction only. In the most paranoid case, it may be preferable to send bit-stuffed, fixed-length cells in both directions. However, this greatly increases processing load at both ends. Thankfully, it isn't required to also be DC-balanced. From a failed venture, I already have rights to tested code which performs this function.
(This is the 18th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I'm hoping to create a system which can drive up to 32 speakers for US$300 and I'm enthused that competitors are charging obscene prices for similar equipment. For example:-
I've made significant advance when comparing micro-controller architectures. AVR has a distinct advantage for low-level operations given that it has 32×8 bit registers which can be paired into 16 bit registers. ARM has a distinct advantage for DSP functionality and deliberately aims to provide a single-cycle multiply (or multiply-accumulate). However, ARM provides 13 general-purpose registers and, in Thumb mode (16 bit instructions), this is restricted to 8×32 bit registers.
To implement a chunky-to-planar bit matrix transpose, the latter has a small advantage. Likewise, at any given speed, one-cycle multiply out-performs two-cycle multiply. So, ARM is the target architecture. XMos would be a great architecture but it is too niche for me.
By chance, I found that an Arduino Due meets or greatly exceeds specification in all categories with the exception that there is 96KB RAM. This is exactly the size required for triple-buffering of 4 channel, 32 bit PCM audio at 48kHz when transferred in 2048 sample blocks. It is possible to transfer audio in smaller chunks but this may incur (additional) scope for audio glitches when decoding video with a frame rate below 30Hz.
An Arduino Due has hardware SPI but it would be unable to handle 25Mb/s reliably. Or, more accurately, a suitable sub-multiple of 84MHz. So, the chunky-to-planar bit matrix transpose remains useful. However, with 54 GPIO, it may be possible to bit-bang 32 or more SPI DACs directly. One option to reduce cabling is to have four or more satellite DACs with four or more channels per satellite DAC.
Any use of Arduino incurs linking to code under multiple open source licences. This includes GPL2 and Creative Commons libraries and an LGPL boot-loader. At the very least, use of an Arduino boot-loader requires compiled code to be distributed. This applies even if a cloned Arduino uses an Arduino boot-loader. If you don't like these terms then seek alternatives.
Anyhow, hardware requirements can be arranged around a US$37.40 board if implementation is open and buffer size is amended.
(This is the 17th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
For a speaker array, basic problems between host computer and a micro-controller can be overcome. An outline solution is host -> USB2.0 -> device -> SPI -> DACs. Blocks of sound are transferred over USB. Each block nomimally represents 48kHz sound for up to 1/24 second (2000 samples or so). However, without exceeding the USB2.0 bandwidth limitation of 12Mb/s is is possible to transfer:-
Each block of samples is sent with a type, a length and one or more checksums. When this data is placed into a triple-buffering system, the micro-controller may seamlessly switch type when processing the next buffer.
Selection of cost-effective components is an art that I haven't mastered. My technique is to obliquely search EBay by functionality. This gives an overview of surplus components and cloned components. From this, it is trivial to find official datasheets. This invariably encounters warnings from manufacturers to not use legacy components in new designs and instead use components which, back on EBay, are up to 10 times more expensive. Obviously, I could use comparison functionality on the more advanced retail websites but this provides an overview.
After reading many datasheets, I'm not much further ahead. What DACs should be used? Maybe Analog Devices AD1952? Linear Technologies LTC2664 16 channel I2S DAC? Maxim MAX5318 18 bit SPI DAC? Or one of the many other choices?
After staring at I2S for a long time, it appears that, yes, it has a passing similarity to I2C or SPI with the exception that:-
Some components very obviously follow the technique poineered by Dallas Semiconductor where the device is made with different modes of operation. In this case, different interfaces are notched out with a laser according to market demand. Given that DACs may be laser tuned, this is one of the most obvious places to increase margin on commodity components.
Some DACs interfacing with SPI or I2S may be connected to a serial stream in parallel and the selectively slurp data via a hand-over signal. This allows DACs to scale without incurring bit errors from, for example, typical SPI daisy-chaining devices in series.
I considered the possibility of performing I2S (or suchlike) without a dedicated interface. This would provide the most design flexibility because the serial format would be defined entirely in software. If one DAC is discontinued then it would be possible to modify software (and board wiring) and continue with a different DAC. However, 32 × 16 bit samples at 48kHz is a bit-rate execeeding 25Mb/s. To raise and lower one clock signal from software requires at least 50 MIPS. This excludes processing power to perform any other functionality. Toggling can be amortized by ganging eight or more serial streams. However, this requires an intermediary, such as a shift register - or a chunky-to-planar bit matrix transpose, such as performed by a Commodore Amiga Akiko Chip. 4014 parallel-to-serial shift registers are too slow (and cumbersome).
The task of interest is to take eight bytes of data and output, for example, the bottom bits of each byte to a micro-controller's parallel port. Then one pin can be toggled. This acts as a clock for eight separate serial streams but only requires two instructions to signal a change of state to all downstream devices. Unfortunately, the transpose which preceeds output is processor intensive. If a CPU has suitable bit rotate operations through a carry flag or suchlike, it may be possible to zig-zag in 64 clock cycles or so. 64 conditional tests would require two or three clock cycles for each test. Is there a faster method? The benefit would be a greater volume of output and possibly more channels. (Something akin to VGA Mode X graphics popularized by Quake.) Or reduce power consumption. Or a reduced hardware specification.
The simple software transform requires one or more instructions per bit - and that assumes sufficient registers and flags. When I first encountered this problem, I considered a chain of rotates via one flag register. However, after consideration of quadtrees and matrix multiplication optimization, it is "obvious" to me that a matrix of 2^n×2^n bits can be transposed in n iterations. For 8×8 bits, three iterations are required. The first iteration swaps two opposing 4×4 blocks. The second iteration swaps two opposing 2×2 blocks in each quadrant. The third iteration swaps individual bits. If bytes are held in separate variables, this requires eight registers to hold the data and more registers for bitmasks and intermediate values. This works poorly on many micro-controllers. For example, ARM Thumb mode only has eight general, directly addressable registers. Thankfully, values can be ganged into 16 bits, 32 bits or even 64 bits. This significantly reduces the quantity of registers required. It also greatly reduces the number of instructions (and clock-cycles) required for a transpose operation.
The overall result is that 25Mb/s can be bit-banged with less than 15 MIPS of processing power. However, this only applies if eight streams are bit-banged in parallel. Other functionality, including 6 million multiplies per second, may remain within a 40Mhz processing budget.
A Forbes' contributor says that the "US Newspapers' Problems Come From Their Former Monopoly, Not The Duopoly Of Facebook And Google."
That is only a part of the problem. There are far larger ones.
First, the prices of their newspapers. The skinny little State Journal-Register costs a full dollar and has very little news you won't find in other outlets. The Illinois Times prints theirs free, making money from advertising alone, and it is superior to the incredibly poor SJ-R.
But mostly it's how abysmal their web sites are. Know why I'm not reading your ads? No, not AdBlock; it isn't installed. It's because I've read the article in less time than the incredibly bloated web page loads and far faster than the even more bloated ads load. By the time the ads finish loading, I've already closed the tab. The St Louis Post-Dispatch is abysmal with loading; a full thirty seconds, then it goes blank, and takes another full minute, and every article is like that! They, and almost every other paper, badly need a competent webmaster. Except for extremely long or graphics-laden pages, the damned thing should load in seconds. Hire someone competent, who actually knows HTML and doesn't have to resort to one of those stupid programs that take your 5k of text and turn it into a 5 meg page. Today's sites load slower on high speed internet than back in the 33k dialup days.
Then there's "click to read more" after only half a paragraph is displayed. What in the hell is wrong with those morons? They expect me to subscribe to this garbage and actually PAY for it after annoying me?? STUPIDITY!
Then there are so many stupid pages that render in a six point typeface, gray on white, on a tablet that when you zoom, the ads completely cover the text! With morons like that working for your paper you expect me to believe anything you've written? The science rags are the worst about this, but Newsweek isn't any better. Zoom the page and the stupid social media bullshit covers the text!
Look, morons, nobody goes to your stupid site because it's got a "cool" interface, they go to find out what's happening in the world, and you seem to work hardest at making that as difficult as possible. And you expect me to PAY you for that? How fucking stupid can a person be?
Then there's the quality problem. Two decades ago I rarely saw a typo and never a grammatical error, these days few articles are error-free. You idiots expect me to PAY for that unprofessional garbage?
No, the newspapers are dying from blood loss, caused by repeatedly shooting themselves in the foot. Fire the idiots and you might start making money again! Of course, if you're the publisher, that means you have to fire yourselves, because you're the most moronic at all!
(This is the 16th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
Practical problems with a speaker array:-
Cable Losses
It is reasonable to assume that a computer or media player will be ajacent to a display and therefore much of the electronics will be in front of a user. If speakers are to be placed around one or more users then speakers at the rear require longer cables. With 4 Ohm speakers and cheap cable, energy loss within cables is significant to the extent that speakers at the front will be obviously louder.
The ideal solution is to use good cables and equal length cable. However, a workaround it to attentuate the volume of the front speakers. This is a horrible bodge because it is a software fix for a hardware problem. The preferred solution is to remedy hardware first then use software to trim values if strictly necessary.
Speed Of Sound
Speakers should be triggered such that sound arrives simultaneously at a user's ears. The speed of sound is relatively slow. Therefore, speakers should be placed at identical distances from a user or some kind of compensation mechanism is required. In a domenstic environment, the latter is required. Again, this is inappropriately pushed into software. While speakers are nomimally driven in unison, speakers which are nearer to a user should use historical inputs. This creates multiple problems. Historical inputs require a circular buffer which increases memory requirements. It also incurs a level of indirection which increases processing requirements. The memory requirements also determine at maximum difference in distances. How significant is this difference?
The speed of sound is approximately 343m/s. (This varies with air pressure and humidity. Some systems take this into account.) Divide by sample frequency to obtain distance travelled at each time-step. At 48kHz, the distance is approximately 7mm. That's about 1/4". At higher frequencies, this distance is proportionately shorter.
If we assume a budget of about 16KB to implement a 1024 element circular buffer then the maximum difference between speaker distances is about 7 metres at 48kHz. (Higher sampling frequencies require a smaller range of distances or proportionately larger buffers. Other parts of the system require larger buffers and therefore the circular buffer may similarly grow in proportion.)
The inverse square energy dissipation is another reason for attentuation of speaker volume. Specifically, the nearest speakers should be driven at lower volume to compensate for differences in speaker distance. This would not be required if all speakers were at a uniform distance.
DAC Skew
A sequential program which drops data sequentially into registers may incur a small amount of signal skew. In air, this probably incurs less than 7mm of skew but it bothers me greatly. Perhaps it is because it is akin to an analog off-by-one problem.
Host Latency
A host decodes one block of audio, sends data out, decodes one frame of video and displays it. However, audio and video triple buffering may not align. Specifically, audio and video may be half a frame or more out of phase. However, with video transit over PCI Express and audio transit over USB2.0 (and contention over both), audio is likely to lag behind video. This requires audio to be sent ahead.
Host Skew
It is envisioned that audio will be played in rotation from three buffers and that, in the general case, the three buffers will be equal size. (There may be transitions where this assumption is false.) There may be jitter over a contested bus when receiving audio and therefore, one buffer should be replaced when it directly opposes the pointer for data being played. If the pointer is either side of this value, then one sample should be skipped or played twice. This is a tacky method to synchronize input because it goes completely against all of the theory about linear reproduction and fundamental frequencies. However, it works and it is computationally cheap.
Crash Or Disconnection
If buffers are not replaced in a timely manner, three buffers may be played in a tight loop of 0.12 seconds or shorter. This case should be avoided. If the host crashes or disconnects, audio should be silenced. If the micro-controller crashes, recovery is likely within a very brief period. However, occasional host polling is required. State, such as matrix multiplication constants, and buffers may require two or more frames to recover.
Errors In Data
There may be insufficient time to re-send a block of data. Blocks with obvious errors should incur silence. Blocks with minor errors may be recovered. The rate of errors and the probability of mis-classification is an emperical exercise.
(This is the 15th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I started with the observation that computing mostly consists of paper simulation rather than structured information. I started describing a URL-space to overcome this limitation. Then I mentioned problems with network addressing and packet payload size which affect a multi-cast streaming server. After describing the outline of streaming audio hardware and software (part 1, part 2, part 3, part 4, part 5) and some speaker design considerations, we've spanned the extent of this project. The remainder is detail, in-filling and corollaries.
The first detail is how to interface a speaker array to a host computer. For simplicity, we'll assume one source of WXYZ Ambisonic sound-field at 48kHz within an .AVI file. This is four channel audio. As previously described, that's one channel of omnidirectional sound and three channels of directional sound (left-minus-right, front-minus-back, top-minus-bottom). For each time-step (48000 times per second), a four element vector is multiplied with a 4×32 element matrix to obtain the output for each speaker in an array. This requires about 6.1 million multiplications per second. However, what hardware processes this data? A host computer? A dedicated processor? Some kind of analog process?
Analyze the situation and choose suitable interfaces. Matrix input is 48kHz × 4 channels × 32 bits. That's about 6.1Mb/s. Assuming a daisy-chain of 16 bit SPI DACs, matrix output is 48kHz × 32 channels × 16 bits. That's about 25Mb/s. Considering a list of suitable host interfaces against availability and cost (EtherNet, USB, FireWire, SPI, SCSI), 10Mb/s EtherNet and 12Mb/s USB2.0 provide suitable bandwidth and the latter would be the most conventional.
How much RAM is required (and does this affect packet size or type)? We assume the .AVI has 24Hz, 25Hz, 30Hz, 50Hz or 60Hz video only. For each frame, this requires transfer of 2000, 1920, 1600, 960 or 800 time-steps where each is 4 channels × 32 bits. This requires triple buffering of 32000 byte buffers. So, a micro-controller or DSP of the following specification is required:-
So, a 40MHz micro-controller with, USB, SPI, hardware multiply and 128KB RAM would be sufficient. It may even be possible to perform 64 bit multiplication on such hardware. However, this specification doesn't have much headroom. In particular, multiple sound sources require mixing by the host computer. This is particularly awkward if one sound source is Ambisonic while another is stereo. Regardless, communication from host to micro-controller should include the following:-
Meanwhile, host computer should include the following:-
A board with this functionality would cost about US$2 per channel. However, this excludes power, connectors or a box. Due to SPI allowing open-loop control, it is possible to keep the same firmware and make versions of this system with less audio outputs.
Connectors may be two bare wire terminals per speaker, one phono connector per speaker or one headphone socket per speaker pair. The latter is the most compact and cost-effective.
How does this system differ from Dolby Atmos? Dolby Atmos permits 128 point sound sources to be mixed for 500 people. That's a particularly ambitious sweet-spot. It potentially requires matrix multiplication for a 128×50 matrix (or taller) per audio time-step. This is in contrast to a 4×32 matrix (or shorter) per audio time-step.