(This is the 15th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I started with the observation that computing mostly consists of paper simulation rather than structured information. I started describing a URL-space to overcome this limitation. Then I mentioned problems with network addressing and packet payload size which affect a multi-cast streaming server. After describing the outline of streaming audio hardware and software (part 1, part 2, part 3, part 4, part 5) and some speaker design considerations, we've spanned the extent of this project. The remainder is detail, in-filling and corollaries.
The first detail is how to interface a speaker array to a host computer. For simplicity, we'll assume one source of WXYZ Ambisonic sound-field at 48kHz within an .AVI file. This is four channel audio. As previously described, that's one channel of omnidirectional sound and three channels of directional sound (left-minus-right, front-minus-back, top-minus-bottom). For each time-step (48000 times per second), a four element vector is multiplied with a 4×32 element matrix to obtain the output for each speaker in an array. This requires about 6.1 million multiplications per second. However, what hardware processes this data? A host computer? A dedicated processor? Some kind of analog process?
Analyze the situation and choose suitable interfaces. Matrix input is 48kHz × 4 channels × 32 bits. That's about 6.1Mb/s. Assuming a daisy-chain of 16 bit SPI DACs, matrix output is 48kHz × 32 channels × 16 bits. That's about 25Mb/s. Considering a list of suitable host interfaces against availability and cost (EtherNet, USB, FireWire, SPI, SCSI), 10Mb/s EtherNet and 12Mb/s USB2.0 provide suitable bandwidth and the latter would be the most conventional.
How much RAM is required (and does this affect packet size or type)? We assume the .AVI has 24Hz, 25Hz, 30Hz, 50Hz or 60Hz video only. For each frame, this requires transfer of 2000, 1920, 1600, 960 or 800 time-steps where each is 4 channels × 32 bits. This requires triple buffering of 32000 byte buffers. So, a micro-controller or DSP of the following specification is required:-
So, a 40MHz micro-controller with, USB, SPI, hardware multiply and 128KB RAM would be sufficient. It may even be possible to perform 64 bit multiplication on such hardware. However, this specification doesn't have much headroom. In particular, multiple sound sources require mixing by the host computer. This is particularly awkward if one sound source is Ambisonic while another is stereo. Regardless, communication from host to micro-controller should include the following:-
Meanwhile, host computer should include the following:-
A board with this functionality would cost about US$2 per channel. However, this excludes power, connectors or a box. Due to SPI allowing open-loop control, it is possible to keep the same firmware and make versions of this system with less audio outputs.
Connectors may be two bare wire terminals per speaker, one phono connector per speaker or one headphone socket per speaker pair. The latter is the most compact and cost-effective.
How does this system differ from Dolby Atmos? Dolby Atmos permits 128 point sound sources to be mixed for 500 people. That's a particularly ambitious sweet-spot. It potentially requires matrix multiplication for a 128×50 matrix (or taller) per audio time-step. This is in contrast to a 4×32 matrix (or shorter) per audio time-step.
(This is the 14th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I had difficulty getting an operational amplifier to work. In the case of audio, signals are often positive and negative relative to a ground wire. It isn't obvious how to get this working with a battery powered audio amplifier. Most of the explanations and circuit diagrams omit details about virtual ground and expect it to be known already.
The correct method for wiring audio input to a battery powered audio amplifier involves powering the operational amplifier from battery and also connecting two equal, large-value resistors in series across this power supply. The mid point between the resistors is a halfway Voltage. This can be used as an input to an operational amplifier. It should also be tied to the ground wire of the input.
This arrangement allows an operational amplifier to work with "positive" and "negative" Voltages without incurring any Voltage clipping. Shortly after learning this, I ensured that a friend at my makerspace avoided the fruitless avenues that I encountered. This has spurred development which continues to present. This includes irregular use of a TDA7379.
After some fun experiments with mains fluorescent lights and sine waves at 101Hz or 121Hz, my friend made a laser-cut box for a speaker. The first attempt had a horrible resonance. It was really glaringly bad. With hindsight, it was a poor choice to make a cuboid box with two identical lengths. After some modification to the design, a much better result was obtained.
When someone at the makerspace subsequently said they wanted to make cube speaker enclosures, with JBL speakers and a US$12 Chinese amplifier, I thought it would be truly awful. The concept was good. The speakers were supposed to fit into Ikea pigeon-hole shelving. Official accessories for this shelving include wicker baskets and wine racks. It is also possible to purchase computer cases which are the same size. Speakers would be an obvious addition. That's the general opinion when the concept is explained.
However, I was concerned about the resonance of a cube box, a speaker brand favored by Pimp My Ride rice-racers and an amplifier fresh off of EBay. I was expecting it to be dire. However, after being invited to a sound test, I was astounded. I don't have much sound test experience but I've never heard anything so good. The amplifier was excellent and JBL make really good 4 Ohm speakers. The secret is to make a cube box with really thick panels. Don't skimp. Make it from particle board but make sure it is heavy, 3/4" particle board.
Let's run through some known content. AC/DC? No complaints. Foo Fighters? No complaints. Bob Marley? It is possible to hear the limitations of the analog recording *and* mastering processes. Boston? Oh, that was special. I've never heard such distinct sound separation. It was possible to hear the phasing effect when one enclosure was twice the distance of the other.
I hoped these speakers would be commercialized. There was even an outside possibility of them getting branded as BBC monitor speakers. Unfortunately, this was another former Apple developer from the lean period in the 1990s who subsequently endured health and financial problems. Even worse, the cube speaker designer died in Jan 2017. However, some of the work continues.
There is no substitute for a speaker cluster with an 18 inch (or larger) woofer but a car speaker in a thick box is an acceptable compromise for a small apartment. A one foot cube speaker should be considered as the mid-range option but there is plenty of scope for cheaper alternatives. I wonder if it is desirable to put a two inch or three inch speaker inside a six inch box? This would be a miniture version of a cube speaker and they would probably be sold in packs of six or packs of eight to reduce shipping cost. (Actually, packs of eight may allow re-use of packaging.)
The really cheap option would be US$2 stereo speakers which plug into a headphone socket. I attempted to repair one variant for an ex-housemate. They are impressively cheap to the extent that I laughed over the hair-thin wire and the plastic box which holds the speaker coil against the magnet. Regardless, this is a very economical option to bootstrap a 30 element speaker array. I hope that a theoretical US$20 version would have better, longer cables and would be sufficiently loud and consistent.
(This is the 13th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I sent similar text to a friend:-
It occurred to me that you've got no idea what I've been doing. Essentially, I'm going up the stack to increase value. The merit of doing this was demonstrated very succinctly by a friend who made US$800 per month on EBay by selling 3D printers and 3D printer parts. (Apparently, people pay US$50 for a print head which consists of a bolt with a hole through its length and a coil of wire which acts as a heating element.) My friend explained principles which could be applied widely and they seem to work. Specifically, adding another step in the chain often doubles value.
When this principle is applied to network protocols, the obvious move is to have content which can be delivered over a network. Even if this strategy fails, the content has value and alternative methods of distribution can be found. Following this principle, I suggested the development of content. I wish that I had emphasised this much more. Since suggesting this, companies such as Amazon have:-
- Formed streaming video divisions.
- Developed content in parallel with distribution systems.
- Gained subscribers for content.
- Obtained awards for content.
Indeed, it has been noted that traditional US broadcasters received no awards at the 2015 Golden Globes and this trend may continue.
Streaming video may be a saturated market but streaming audio has been neglected. With bandwidth sufficient for competitive streaming video, it is now possible to stream high-quality, 24 bit, surround-sound audio. Indeed, from empirical research, it appears that audio can be streamed over a connection with 70% packet loss and still retain more quality than a RealAudio or Skype stream with 0% packet loss.
From attempts to replicate the work of others, I've found a method to split audio into perceptual bands. If particular bands are retrieved with priority, it is possible to obtain approximately half of the perceptual quality in 5% of the bandwidth. The technique uses relatively little processing power; to the extent that it is possible to encode or decode CD quality audio at 500KB/s on an eight year old, single core laptop.
The technique uses the principle of sigma-delta coding where it is not possible to represent an unchanging level of sound. This limitation can be mitigated by having a hierarchy of deltas. (And where this fails, we follow Meridian Lossless Packing and provide a channel for residual data.) Ordinarily, most people would choose a binary hierarchy but this leaves us two techniques deep and still encountering significant technical problems. Specifically, a binary hierarchy of sigma-delta encodings practically doubles the size of the encoding and may increase the required processing power by a factor of 40 or more.
A consideration of buffers allows other hierarchical schemes to be considered. The buffer for encoding and decoding is w*x^y samples where w is always a multiple of 8 (to allow encodings to always be represented with byte alignment). After rejecting x=1 (the trivial implementation of sigma-delta encoding) and x=2 (binary hierarchy), other values were investigated. x=4, x=8 and x=9 resolve to other cases. The most promising case is x=3 which provides a good balance between choice of buffer size, minimum frequency response, packet requests and re-requests, perceptual quality in the event of packet loss, encoding overhead and processing power requirements.
Unlike x=5 or x=7, x=3 also provides the most bias for arithmetic compression. Given that encoding is represented as 1:3 hierarchies of differences, an approximation at one level often creates three opposing approximations at the next level. Over one cascade, zero, one, two or three increases correspond with three, two, one or zero decreases and, in aggregate, a bias at one tier dovetails with an opposing bias in the preceding tier to the extent that arithmetic compression of 20% can be expected with real-world 16 bit audio data.
x=3 also provides a compact representation for URLs when requesting fragments of data from stateless servers. Many representations are functionally equivalent. One particularly graphic representation [not enclosed] shows how tiers of ternary data may be represented in one byte of a URL. Although the representation appears sparse, approximately half of the possible representations are used and therefore only one bit per byte is wasted. An alternative representation is pairs of bits, aa bb cc dd, ee ff gg hh, ii jj kk ll where each pair may be 01, 10 or 11 when traversing down the tree and 00 is a placeholder when the desired node has been reached. This creates the constraint that sequences of 00 must be contiguous and stem from one end. However, this also allows URLs to be abbreviated to reduce bandwidth. A further representation provides three sub tiers per byte rather than four but allows logging in printable ASCII.
The BBC's Dirac video codec shows that it is possible to encode frames of video in an analogous manner. Specifically, frames of video can be encoded as a tree of differences. Trees may be binary trees, ternary trees or other shapes.
Overall, this follows a strategy in which:-
- Users have a compelling reason to use a system.
- Value is obtained by folding applications into a URL-space rather than fitting code to legacy constraints.
- Servers have low computational and interactive requirements.
Continuing from previous article, outline specification sent to a de-correlation expert:-
A system exists for transferring data. This system allows data to be transferred reliably outside of the parameters of TCP/IP with Window Scaling and DSACK. Development of the system began as an independent project but has subsequently been funded by [redacted]. At one point, the system was being developed by five programmers, one mathematician, one graphic designer and two administrative staff.
Regardless, the system does not process data as it is received but it is desirable to add such functionality as a library. This capability has been identified as a financially viable market. (See Janus Friis and Niklas Zennström's development of Kazaa, Skype, Joost and other ventures.) Furthermore, the market can be segmented into real-time streaming of low-quality audio and delayed delivery of high quality audio. In each case, it is possible to exceed the perceptual quality of Compact Discs and Long Playing Records respectively.
Many systems provide streaming below this quality (FM radio, digital radio, NICAM, MP3 delivery, RealAudio) but the ability to provide higher quality audio will improve as bandwidth increases. Even if this market is not broadly viable, it remains a lucrative niche for audiophiles with large disposable income. Even if this market is unviable, there remain applications for higher quality audio or a more compact representation of lossless audio. This may be with or without an accompanying video codec.
The benchmark for sound quality is Compact Disc "Red Book" audio; introduced in 1980. This specifies 2KB sectors with Reed-Solomon regenerative checksums. Each sector provides 1,024 16 bit PCM samples for one of two audio channels at a sample rate of 44.1kHz. Techniques to smooth the 1% of dropped sectors create a frequency floor of approximately 43Hz irrespective of data encoded on a disc. This creates a notable absence of bass frequencies and a perceptual comb of frequencies which are integer multiples of 43Hz.
Although modern audio codecs have good perceptual response and raise the nominal sampling frequency from 44.1kHz to 48kHz or higher, the encoded datarate of MP3, AAC and similar codecs is often less than 25KB/s over two channels. This is opposed to 176KB/s for CD audio.
For an extended period, one of the system developers has been aware of efforts by a company in Cambridge, England which may be regarded as a direct competitor. Meridian first came to our attention around 1982 when the development of [MLP] Meridian Lossless Packing was demonstrated in a particularly rigged manner on the BBC [British Broadcasting Corporation]'s programme Tomorrow's World. The demo involved the chassis of a CD player and two digit LED display. When playing a conventional Compact Disc, the display read "16" to indicate 16 bit PCM [Pulse Code Modulation] data. When playing a MLP Disc, the display read "4" to indicate the bitrate of MLP. However, a cursory investigation of MLP, such as http://www.meridian-audio.com/w_paper/mlp_jap_new.PDF via http://en.wikipedia.org/wiki/Meridian_Lossless_Packing, reveals that the lossless encoding technique removes an *average* of 11 bits of data per sample - leaving approximately 5 bits per sample. Furthermore, the switch from 16 bit to 4 bit was performed during a cutaway shot and therefore the BBC may have been complicit in this rigged demo. Finally, any attempt to exhibit the perceptual quality of MLP over a television broadcast was futile because the channel capacity of television audio is lower than Compact Disc quality.
Further involvement with Meridian came in the form of a declined interview after a long discussion about digital speaker synchronisation and the computational requirements of wireless digital headphones.
One of the system developers is also aware of SACD [Super Audio Compact Disc], an audio encoding standard devised by Professor Jamie Angus and included in the first and second revision of the Sony PlayStation 3. Attempts to re-implement this technique in a digital system required an inordinate amount of bandwidth. Regardless, experiments led to a novel encoding technique and a greater appreciation of analog circuitry.
Regardless, Meridian's technique for reducing datarate is an ideal template for further development. Although care has to be taken to avoid known patents. The basic MLP specification is that:-
- An encoding technique should allow multiple, arbitrary 24 bit streams to be interleaved without change to the bit sequences.
- An encoding technique may make assumptions about the presence of return-to-zero data and therefore compression may only occur when suitable audio data is presented.
- Significant compression may be obtained by observing correlations between channels.
- Finite Impulse Response filters can be used to reduce the volume of data to be encoded.
- Lossy techniques and predictive systems may be deployed within an encoder and a decoder. A residual channel of data makes lossy, deterministic techniques into a lossless system.
Further requirements for streaming are as follows:-
- It is desirable to split streams into components so that they may be prioritised to maximise perceptual quality.
- Therefore, it is desirable to extract signal from streams prior to the application of techniques such as FIR.
In addition to streaming and otherwise exceeding the MLP specification, the following is desirable:-
- It should be possible to decode sound on legacy hardware. An embedded device should be able to decode one channel of sound and output 6 bit quality or better. A legacy desktop should be able to decode two or more channels of sound at 8 bit quality.
- It should also be possible to encode sound of limited quality on legacy hardware.
- When data is missing, it should be possible to re-construct sound with frequencies below 43Hz. Ideally, it should be an option to encode and re-construct sound with frequencies which are multiples of 1nHz or less. (At data rates exceeding 44.1kHz, this may require block sizes exceeding 44 billion samples. For 24 bit samples, this requires 120GB or more per channel. It may also require handling of RMS [Root Mean Square] errors with a magnitude of 82 bits or more.)
- Entropy encoding may be more effective if there is a bias in the symbol frequencies. Therefore, blocks may be a multiple of x^y where x is odd rather than even.
Finally, all techniques are valid. A direct clone of SACD or MLP is not desirable due to licensing issues. However, simplified or novel techniques around SACD and MLP are highly desirable.
Further mention of Meridian, sigma-delta encoding and related topics.
What is the minimum frequency response which should be represented by an audio codec? Well, what is the best analog or digital system and is it worthwhile to approach, match or exceed its specification?
Analog record players have an exceptionally good minimum frequency response. Technically, the minimum frequency response is determined by the duration of the recording. So, a 45 minute long-play record has a fundamental frequency of 1/2700Hz - which is about 0.0004Hz. Does this make any practical difference? Most definitely yes. People lampoon analog purists or stated that LP and CD are directly equivalent. Some of the lampooning is justisfied but the core argument is true. Most implementations of Compact Disc Audio don't achieve useful functionality.
Many Compact Disc players boldly state "1-Bit DAC Oversampling" or suchlike. The 1-Bit DAC is simplistically brilliant and variants are used in many contexts. However, oversampling is the best of a bad bodge. The optics of a Compact Disc are variable and, even with FEC in the form of Reed-Solomon encoding, about 1% of audio sectors don't get read. What happens in this case? Very early CD players would just output nothing. This technique was superceded with duplicate output. (Early Japanese CD players had an impressive amount of analog circuitry and no other facility to stretch audio.)
Eventually, manufacturers advanced to oversampling techniques. In the event that data cannot be obtained optically from disc, gaps are smoothed with a re-construction from available data. Unfortunately, there is a problem with this technique. Nothing below 43Hz can be re-constructed. 2KB audio sectors have 1024×16 bit samples and samples are played at exactly 44.1kHz. So, audio sectors are played at the rate of approximately 43Hz. However, any technique which continues audio waves has a fundamental frequency of 43Hz. Given that drop-outs occur with some correlation at a rate of 1%, this disrupts any reproduction of frequencies below 43Hz. For speech, this would be superior to GSM's baseline AMR which has 50Hz audio blocks. For music, this is a deal-breaker.
Remove the intermittant reading problems and the fundamental frequency is the length of the recording. So, a 16 bit linear PCM .WAV of 74 minutes has a fundamental frequency of approximately, 0.0002Hz. The same data burned as audio sectors on a Compact Disc only has a fundamental frequency of 43Hz. (I'll ignore the reverse of this process due to cooked sectors.)
So, it is trivial to retain low frequencies within a digital system. Just ensure that all data is present and correct. Furthermore, low frequencies require minimal storage and can be prioritized when streaming.
A related question is the minimum frequency response for a video codec. That's an equally revealing question.
(This is the 10th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
Audio and video streaming benefits from stream priorities and three dimensional sound can be streamed in four channels but how can this be packed to avert network saturation while providing the best output when data fails to arrive in full?
The first technique is to de-correlate audio. This involves cross-correlation and auto-correlation. Cross-correlation has a strict hierarchy. Specifically, all of the audio channels directly or indirectly hinge upon a monophonic audio channel. It may be useful for three dimensional sound to be dependant upon two dimensional sound which is dependant stereophonic sound which is dependant upon monophonic sound. Likewise, it may be useful for 5.1 surround sound and 7.1 surround sound to be dependant upon the same stereophonic sound. Although this allows selective streaming and decoding within the constraints of available hardware, it creates a strict pecking-order when attempting to compress common information across multiple audio channels.
Cross-de-correlated data is then auto-de-correlated and the resulting audio channels are reduced to a few bits per sample per channel. In the most optimistic case, there will always be one bit of data per sample per channel. This applies regardless of sample quality. For muffled, low quality input, it won't be much higher. However, for high quality digital mastering, expect a residual of five bits per sample per channel. So, on average, it is possible to reduce WXYZ three dimensional Ambisonics to 20 bits per time-step. However, that's just an average. So, we cannot arrange this data into 20 priority levels where, for example, vertical resolution dips first.
Thankfully, we can split data into frequency bands before or after performing cross-de-correlation. By the previously mentioned geometric series, this adds a moderate overhead but allows low frequencies to be preserved even when bandwidth is grossly inadequate.
It also allows extremely low frequencies to be represented with ease.
(This is the ninth of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
A well-known trick with streaming media is to give priority to audio. For many viewers, video can glitch and frame rate can reduce significantly if audio quality is maintained. Within this constraint, monophonic audio can provide some continuity even when surround sound or stereo sound fails. This requires audio to be encoded as a monophonic channel and a left-minus-right channel but MLP demonstrates that mono, stereo and other formats can be encoded together in this form and decoded as appropriate. How far does this technique go? Well, Ambisonics is the application of Laplace Spherical Harmonics to audio rather than chemistry, weather simulation or a multitude of other purposes.
After getting through all the fiddly stuff like cardiods, A-format, B-format, UHJ format, soundfield microphones, higher order Ambisonics or just what the blazes is Ambisonics? we get to the mother of all 3D sound formats and why people buy hugely expensive microphones, record ambient sounds and sell the recordings to virtual reality start-ups who apply trivial matrix rotation to obtain immersive sound.
Yup. That's it. Record directional sound. Convert it into one channel of omnidirectional sound and three channels of directional sound (left-minus-right, front-minus-back, top-minus-bottom). Apply sines and cosines as required or mix like a pro. The result is a four channel audio format which can be streamed as three dimensional sound, two dimensional sound, one dimensional sound or zero dimensional sound and mapped down to any arrangement of speakers.
Due to technical reasons a minimum of 12 speakers (and closer to 30 speakers) are required for full fidelity playback. This can be implemented as a matrix multiplication with four inputs and 30 outputs for each time-step of audio. The elements of the matrix can be pre-computed for each speaker's position, to attenuate recording volume and to cover differences among mis-matched speakers. (Heh, that's easy than buying 30 matched speakers.) At 44.1kHz (Compact Disc quality), 1.3 million multiplies per second are required. At 192kHz, almost six million multiplies per second are required for immersive three dimensional sound.
For downward compatibility, it may be useful to encode 5.1 surround sound, 7.1 surround sound with Ambisonics. Likewise, it may be useful to arrange speakers such that legacy 5.1 surround sound, 7.1 surround sound, 11.1 surround sound or 22.2 surround sound can be played without matrix mixing.
Using audio amplifiers, such as the popular PAM8403, it is possible to put 32×3W outputs in a 1U box. This is sufficiently loud for most domestic environments.
Test data for audio codecs:-
Test data for video codecs:-
(This is the eighth of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
How can the most data be sent to the most users using the least hardware and bandwidth? Minimize transient state. TCP has a large, hidden amount of state; typically more than 1MB per connection for bulk transfers. This hidden cost can be eliminated with ease but it introduces several other problems. Some of these problems are solved, some are unsolved.
The most obvious reason to use TCP is to obtain optimal PMTU utilization. Minimizing packet header overhead is a worthy goal but it doesn't work. Multi-path TCP or just an increasingly dynamic network makes optimal PMTU discovery a rapidly shifting task. Ignoring this, is it really worth stuffing payloads with the optimal number of bytes when all opportunity to multi-cast is discarded? That depends upon workload but it only takes a moderate number of scale-out cases to skew the general case. One such case is television.
Optimal PMTU utilization also fails when an ISP uses an unusual PMTU. PMTU discovery is significantly more challenging when IPv6 has no packet fragmentation and many IPv6 devices implement the minimum specification of a 1280 byte payload. I'd be inclined to ignore that but tunneling and inter-operability means the intersection rather than union of IPv4 and IPv6 features have to be considered. (It has been noted that IPv6 raises the specified minimum payload but triple-VLAN IPv4 over Ethernet has a larger MTU and is worse case in common use.)
UDP competes at a disadvantage within a POSIX environment. The total size of UDP buffers may be severely hobbled. Likewise, POSIX sendfile() is meaningless when applied to UDP. (This is the technique which allows lighttpd to serve thousands of static files from a single-threaded server. However, sendfile() only works with unencrypted connections. Netflix has an extension which allows SSL certificates to be shared with a FreeBSD kernel but the requirement to encrypt significantly erodes TCP's advantage.)
Some people have an intense dislike of UDP streaming quality but most experience occurred prior to BitTorrent or any Kodi plug-in which utilizes BitTorrent in real-time. No-one complains about about the reliability or accuracy of BitTorrent although several governments and corporations would love to permanently eliminate anything ressembling BitTorrent plus Kodi.
From techniques in common use, a multi-cast video client has several options when a packet is dropped and there is sufficient time for re-send:-
When time for re-send is insufficient, there are further options:-
For a practical example of low quality live-streaming with high quality recording, see FPV drones. In this scenario, remote control airplanes may broadcast low quality video. Video may be monochromatic and/or analog NTSC. Video may be stereoscopic and possibly steerable from an operator's headset. Several systems super-impose altitude, bearing, temperature, battery level and other information. With directional antennas, this is sufficient to fly 10km or more from a launch site. Meanwhile, 1920×1080p video is recorded on local storage. The low quality video is sufficient for precision flying while the high quality video can be astounding beautiful.
Anyhow, UDP video can be as live and crappy as a user chooses. Or it may be the optimal method for distributing cinema quality video. And a user may choose either case when viewing the same stream.
(This is the seventh of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
Using UDP for streaming video seems idiotic and is mostly deprecated. However, it has one huge advantage. It works really well with a large number of concurrent users. Within reason, it is worthwhile to maximize this number and minimize cost. Network games can be quite intensive in this regard with SecondLife previously rumored to having two users per server and WorldOfWarcraft having 20 users per server. At the other extreme, Virtudyne (Part 1, Part 2, Part 3, Part 4) aimed to run productivity software with 20 million users per server. That failed spectacularly after US$200 million of investment.
Maintaining more than 10000 TCP connections requires a significant quantity of RAM; often more than 1MB per connection. A suitably structured UDP implementation doesn't have this overhead. For example, it is possible to serve 40 video streams from a single-threaded UDP server with a userspace of less than 1MB and kernel network buffers less than 4MB. All of the RAM previously allocated to TCP windowing can be re-allocated to disk caching. Even with a mono-casting implementation, there are efficiency gains when multiple users watch the same media. With stochastic caching techniques, it is easy for network bandwidth to exceed storage bandwidth.
There is another advantage with UDP. FEC can be sent speculatively or such that one resend may satisfy multiple clients who each lack differing fragments of data. It is also easy to make servers, caches and clients which are tolerant to bit errors. This is particularly important if a large quantity of data is cached in RAM for an extended period.
So, what is a suitable structure for a UDP server? Every request from a client should incur zero or one packets of response. Ideally, a server should respond to every packet. However, some communication is obviously corrupted, malformed or malicious and isn't worth the bandwidth to respond. Also, in a scheme where all communication state is held by a client and all communication is pumped by a client, resends are solely the responsibility of a client.
From experience, it is wrong to implement a singleton socket or a set of client sockets mutually exclusive with a set of server sockets. Ideally, library code should allow multiple server sockets and multiple client sockets to exist within userspace. This facilitates cache or filter implementation where one program is a client for upstream requests and a server for downstream requests.