Stories
Slash Boxes
Comments

SoylentNews is people

Log In

Log In

Create Account  |  Retrieve Password


My Ideal Operating System, Part 1

Posted by cafebabe on Tuesday April 17 2018, @03:01PM (#3152)
1 Comment
OS

I'm quite impressed with the concept of an exo-kernel. It is one of many variants in which functionality is statically or dynamically linked with user-space code. This variant could be concisely described as applying the principles of micro-controller development to desktop applications and servers.

In the case of micro-controllers, it is typical to include what you need, when you need and deploy on bare hardware. Need SPI to access Micro SD and read-only FAT32 to play MP3 audio? Well, include those libraries. Otherwise don't.

In the case of desktop applications, it is possible to include stub libraries when executing within one process or a legacy operating system or include full libraries to deploy on bare hardware or a virtual machine.

In the case of a server, the general trend is towards containers of various guises. While there are good reasons to aggregate under-utilized systems into one physical server, peak performance may be significantly reduced. For x86, the penalty was historically 15% due to Intel's wilful violation of Goldberg and Popek virtualization requirements. After Spectre and Meltdown, some servers incur more than 1/3 of additional overhead. Ignoring performance penalties, container bloat and the associated technical debt, the trend is to place each network service and application in its own container. This creates numerous failure modes when they start in a random order. This occurs because init systems avoid race conditions within one container but if each service runs in a separate container, this trivial safeguard is defeated.

Regardless, in the case of a server, an application may require a JavaScript Just In Time compiler, read-only NFS access to obtain source code for compilation and a database connection. All of this may run inside a container with externally enforced privileges. However, there is considerable overhead to provide network connections within the container's kernel-space while the compiler (and application) run in user-space. In the unlikely event that a malicious party escapes from the JavaScript, nothing is gained if network connections are managed in a separate memory-space. If we wish to optimize for the common case, we should have application and networking all in user-space or all in kernel-space. Either option requires a small elevation of privileges but the increased efficiency is considerable compared to the increased risk.

Running an application inside in a container may require a fixed allocation of memory unless there is an agreed channel to request more. People may recoil in horror at the concept of provisioning memory and storage for applications but the alternative is the arrangement popularized by Microsoft and Apple where virtual memory is over-committed until a system becomes unstable and unresponsive. The default should be a system which is secure and responsive as an 8 bit computer - and having an overview of what a system is doing at all times.

Similar arrangements may apply to storage. It is possible to have an arrangement where a kernel enforces access to local storage partitions and ensures that file meta-data is vaguely consistent but applications otherwise have raw access to sectors. If this seems similar to the UCSD p-code filing system, a Xerox Alto or my ideal filing system, that is entirely understandable. Xerox implementations of OO, GUIs and storage remain contentious but storage is the least explored.

The concept of an exo-kernel makes this feasible at the current scale of complexity and has certain benefits. For example, I previously proposed use of an untrusted computer for multi-media and trustworthy computers for physical security and process control. Trustworthy computers current fall into three cases:-

  1. Relatively trustworthy micro-controllers of 40MHz or more. These have limited power dissipation and may be programmed on-site to user requirements. This limits the ability to implement unwanted functionality. It may be possible to access micro-controller memory via radio but this is a tedious task if each site has a bespoke configuration.
  2. Legacy 8 bit computers of 2MHz or less. Tampered firmware must work within very limited resources. It is also slow and difficult to tamper with a system which is constructed 10 years or more after an attack.
  3. A mini-computer design which is likely to run at 0.1MHz or less. Cannot rely upon security by obscurity but a surface mount 8 bit micro-coded mini-computer simulating a 64 bit virtual machine is, at present, an unusual case for an aspiring attacker.

In the previous proposal, there is a strict separation of multi-media and physical processes with the exception that some audio functionality may be available on trustworthy devices. This was limited to a micro-controller which may encode or decode lossy voice data, decode lossy MP3 audio or decode lossless balanced ternary Ambisonics at reduced quality. Slower devices may decode monophonic lossless balanced ternary audio at low quality. The current proposal offers more choices for current and future hardware. As one of many choices, the Contiki operating system is worth consideration. It originally a GUI operating system for a Commodore 64 with optional networking. It is now typically used on AVR micro-controllers without GUI. I previously assumed that Contiki was written in optimized 6502 assembly and then re-written for other systems but this is completely wrong. It is 100% portable C which runs on Commodore 64, MacOS, Linux, AVR, ARM and more.

How does Contiki achieve cross-platform portability with no architecture specific assembly to save processor registers during context switches? That's easy. It doesn't context switch because it implements shared-memory, co-operative multi-tasking. How else do you expect it to work on systems without super-user mode or memory management? I've suffered Amiga WorkBench 1.3, Apple MacOS 6 an RISCOS, so I know that co-operative multi-tasking is flaky. However, when everything is written in a strict, subset of C and statically checked, larger implementations are less flaky than legacy systems.

Contiki's example applications include a text web browser and desktop calculator. Typically, these are compiled together as one program and deployed as one system image. The process list is typically fixed at compilation but it is possible to load additional functionality into a running process. This is akin to adding a plug-in or dynamic library. Although it is possible to have dynamic libraries suchlike, this increases system requirements. Specifically, it requires a filing system and some platform specific understanding of library format. Although there is a suggested GUI and Internet Protocol stack, there are no assumptions about audio, interrupts or filing system. Although Contiki is not advertised as an exo-kernel, it is entirely compatible with the philosophy to iclude what you want, when you want.

With relatively little work, it would be possible to make a text console window system and/or web browser and/or streaming media player with the responsiveness, stability and security of an Amiga 500 running Mod player on interrupt. It is also possible to migrate unmodified binaries to a real-time operating system. In this arrangement, all GUI tasks run co-operatively in shared memory in the bottom priority thread. All real-time processes pre-empt the GUI. If the GUI goes astray, it can be re-initialized in a fraction of a second with minimal loss of state and without affecting critical tasks. This arrangement also allows development and testing under Unix via XWindows or Aqua. In the long-term, it may be possible to use Contiki as a scaffold and then entirely discard its code.

If media player plug-ins are restricted to one scripting language (such as Lua which runs happily on many micro-controllers), it is possible to make a media player interface which is vastly more responsive than Kodi - even when running on vastly inferior hardware. As an example, an 84MHz Atmel micro-controller may drive a VGA display and play stereo audio at 31kHz. Similar micro-controllers are available in bulk for less than US$1. Although this arrangement has a strict playback rate and no facility for video decode, it is otherwise superior to a 900MHz Raspberry Pi running Kodi.

They're water-soluble but that doesn't mean they're harmless

Posted by Azuma Hazuki on Friday April 13 2018, @07:51PM (#3145)
15 Comments
/dev/random

I've been tinkering with various B vitamins recently since discovering what seems to be an MTHFR polymorphism or six in my genome. It's just a guess, as I can't spare the money for testing, but the immediate positive effects I've felt from certain forms of certain vitamins all but confirms a) MTHFR SNPs and b) an over-methylation pattern. Which *sounds* paradoxical at first, but really isn't.

People tend to be a little flippant with vitamin C and the B-family since they're water-soluble, reasoning "eh, if I overdose all it means is I get really expensive and really yellow pee." Nooooot...exactly. That's not wrong, but the little buggers will do plenty else before they exit via the kidneys. Here's what I've noticed:

Niacin/B3 - Produces the famous "niacin flush," though much less pronounced than in the first week of taking. About 100-200mg daily. Supposedly there's no harm in taking small (10) integer multiples of this dose, even though 200mg is supposedly almost 2 weeks' worth. Calms me down immensely and helps me sleep. It's also supposed to be good for lowering cholesterol, which is well within normal limits for me, but every little bit helps. Overall definitely a positive.

Pyridoxine/B6 (as pyridoxal-5-phosphate) - Holy crap, this is bad for me. It makes me sleepy and weak and ravenously hungry, then incredibly angry after I eat. How angry? I scared off an almost seven foot tall, 300-pound-plus man at work today. He actually decided not to order because, and this is a direct quote, "Your body language. You're angry and it's scaring me." Now yes, I look pretty much like a six-foot, Caucasian version of my namesake in glasses, and yes, I've been nicknamed "Grumpy Cat" by three separate co-workers at three separate jobs, but that is *bad.* Not touching this one again, at least not before work. Seems to be amping up my metabolism and producing (a lot) more catecholamines such as adrenaline, which would explain the effects.

Folate (as 6(S)-5-methylfolate) - This is the big tell that I've got an MTHFR problem. I felt immediate relief within half an hour after my first dose. Makes me feel, somehow, wet and cool and "fluffy" inside. Not as calming as niacin but still helps, just in a different way. Good synergy. I'm taking this once every few days now, after having spent 2 weeks repleting myself with a daily dose. I don't seem to need anywhere near as much caffeine since starting this one either.

Cobalamin/B12 (as adenosylcobalamin) - Another one for the "nope" column, at least no more than once every two weeks. Has similar effects to B6, though produces more anxiety than outright hostility. I am guessing it's causing either too much glutamate in the brain or, like B6, possibly upregulating stress hormones.

Vitamin C (as ascorbic acid with bioflavanoids, e.g., rutin and quercetin) - I can't tell if this is having any effects, but it doesn't seem to hurt and is important for iron processing, which in turn is necessary during Shark Week. Taking daily seems not to hurt anything, and might have helped me fight off the last two incipient colds I got.

People need to treat these things with more respect. We get people saying "oh supplements don't work," but if that were the case, there's no way they'd be having such pronounced and immediate effects. And, it seems everyone's body is different and even their metabolisms differ from day to day, so in the end, everyone needs to tailor their supplements and the doses thereof to their own physiology. Overall this is a net positive for me, but I'm probably going to avoid the B6...

Trump Reverses Course and Proposes Rejoining TPP

Posted by DeathMonkey on Thursday April 12 2018, @06:58PM (#3141)
22 Comments
News

President Trump, in a stunning reversal, told a gathering of farm state lawmakers and governors on Thursday morning that he was directing his advisers to look into rejoining the multicountry trade deal known as the Trans-Pacific Partnership, a deal he pulled out of within days of assuming the presidency.

Trump Reverses Course and Proposes Rejoining Trans-Pacific Partnership

My Ideal Filing System, Part 2

Posted by cafebabe on Thursday April 05 2018, @04:54PM (#3125)
4 Comments
Software

I might get funding for a de-duplicated filing system. I took the opportunity to send this to a friend with a background in physics, accounting, video editing, COBOL, Java and Windows:-

I'm feeling flush so I'm sending you a letter. Don't complain that you only receive demands for money. You also get the occasional note from a nutjob.

I won't mention health because it can be extremely frustrating to read about such matters. However, I assume that we're both ambling along to some extent. I made moderate progress with electronics and processor design but nothing of particular merit. Gardening is also progressing. I'm growing some parsley for when we next have fish.

I think that I stopped my ex-boss from bidding on chunks of old super-computer clusters. Even if GPUs weren't cheaper, we'd soon get caught by the horrendous cost of electricity. Due to Moore's law, a 10 year computer is likely to require at least 30 times more energy per unit of computation. I think this also stopped various cryptographic currency schemes which would presumably run on said hardware.

I presume that my ex-boss continues to seek funding from investors diversifying out of minerals. That would explain continued focus on hardware and high finance. My ex-boss recently suggested a scheme involving file storage. The unstated assumption would be storage is not out-sourced to a third-party. In particular, this would avoid the dumb-ass case of out-sourcing storage to a direct competitor.

I was initially hostile to this idea due to previous research. The technology is viable but the market invariably has one party which accidentally or deliberately prices storage below cost. It only takes one idiot (or genius) to skew the market for everyone. There are also periods when one party freely hosts illicit content and flames out. MegaUpload and RapidShare are two examples.

Anyhow, there is an outside chance that the third iteration of my de-duplicating filing system may gain significant funding and widespread deployment. We had a long discussion about de-duplication during a two hour walk. I'm not sure if you followed all of the details. Aplogies if you followed very closely.

The first iteration of my work was prototyped inside a database. It used fixed length records and implemented block de-duplication. Deployed systems systems typically use anything from 1KB blocks to 4MB blocks (with alarming variation in the checks for unique data). Larger block sizes are better for databases and video. Unexpectedly, 4KB works particularly well with legacy Microsoft Office documents and you can confirm for yourself that these files are exact multiples of 4KB. However, block de-duplication fails horribly with PKZip files. This is important because PKZip is the basis for numerous contemporary file formats, including Java Archives and .docx

The second iteration of my work was attempt to migrate to variable length blocks. The implementation split files at markers typically found within JPEGs and PKZip files. However, I never found a general case for this process and work was slowed because performance was awful. Compression improved the best case at the expense of the worst case. It also led to the problem of how to efficiently store thousands of different lengths of compressed blocks. When prototyping in a database, this can be delegated to the database but how would this be implemented efficiently in a file of raw disk partition? A solution must exist because it was widely used on desktop computers via DoubleSpace's Stacker and MSDOS6. The computer scientist, Donald Knuth, possibly in 1964, described Fibonacci lengths for memory allocation. This is more efficient than the obvious powers of two because the ratio grows by 1.6 and therefore waste is reduced. With some concessions for powers of two, this is implemented in Sun Microsystems' Arena memory allocator. The same principle applies to storage. For prototyping, a crude arrangement is a de-normalised database with one fixed length database table for each Fibonacci length. There is an analogous arrangement for raw disk.

None of these prototypes handled random access or full-text indexing, like EMC's systems. I was also concerned that, in the most pessimistic case, 3/8 of storage was wasted and, in the typical case, 3/16 is wasted. Waste can be reduced by tweaking figures. However, it should be apparent that there are diminishing returns that cannot be eliminated - unless the uncompressed data is stored in blocks of Fibonacci length.

That's the basis for the third iteration. Although it overcomes previous limitations, it has not been prototyped because it is expected to require far more work than the previous iterations. This is particularly true after a similar number of iterations prototyping full-text indexing. Someone sensible and numerate should ensure that talk of Fibonacci numbers, golden ratios, factorials, random walks, birthday paradoxes and Poisson distributions are not a numeralogical fringe lunatic heading towards a third educational but uncompetitive implementation. (If you're mad enough, there may also be opportunity to define your own rôle in this venture.)

I've been surprised how far prototypes scale. 1KB blocks with a 32 bit primary key allow 242 bytes (4TB) to be stored - and this was never a limitation when it only transfered 11 blocks per second. Search engines are similar. It is ridiculously easy to spread search terms over 4096 hash elements and compress the data to increase retrevial speed by a factor of four. This works for millions of web pages without difficulty. However, at scale, a comprehensive query syntax requires at least six tiers of application servers.

For Fibonacci file de-duplication, the base case is a one byte block. Or perhaps a one bit block. That's obviously ridiculous but it is feasible to map 13 byte sequences to an 8 byte reference and 21 byte sequences to a different set of 8 byte references. You may be incredulous that such a system works. If you are able to refute this assertion then you would save me considerable work. However, if I am correct then we may have the basis of a venture where units historically sold for US$20000 and clusters up to US$3 million. These figures are optimistic but managed systems generally sold for about 20 times more than the cost of the hardware.

I still like your idea of putting a weedy computer into a 4U box primarily so that it may display a huge logo on the front. Unfortunately, implementation is likely to require substantial quantities of RAM. Rules of thumb indicate that ISAM databases typically scale about 10 times further than cache RAM. More advanced schemes, such as InnoDB, typically scale about 100 further than cache RAM. ZFS scales about 1000 times further than cache RAM. Given the proposed structure, it is extremely optimistic to allocate one byte of RAM per kilobyte of storage. Systems which fail to observe this ratio may encounter a sudden performance wall. This precludes the use of very small computers without RAM expansion. In particular, a Raspberry Pi is not suitable to control a dozen hard-disks. A Raspberry Pi is not sufficiently reliable but it should give an indication that proper server hardware is required with all of the associated costs.

I hope to convince you that such a system would work. I would achieve this by defining a huge number of cases and mostly showing that cases can be reduced to previous cases. The most trivial case is opening an uncompressed text file, inserting one comma and saving the file with a different name. In this case, blocks (of any size) prior to the change may be shared among both versions of the file. If the block size is large then the tail of any file has minimal opportunity to obtain the benefits of block de-duplication. However, if the blocks are 21 bytes or smaller then then common blocks are likely to occur among five versions or so. In the worst case, the probability of 21 files having no common tail is the inverse of the factorial of 21. This is less than 1/(5×1019). And for the 22nd case, there is zero probability that savings do not occur.

From empirical testing, formal names, common phrases and URLs a easily de-duplicated. Likewise, the boilerplate of database websites has common sequences. Take 1000 web pages from any forum or encyclopedia and it is trivial to reduce the sequences by a factor of three. You may be concerned that this scheme fails when it encounters the first binary file. This is not the case. Identical files are trivial to de-duplicate. Identical files with staggered offsets (as is typical within email attachments) have a strict upper bound on duplication. Compressed files are similarly ameniable. For example, within PKZip archives, compression re-starts for each file within the archive. THerefore, for each unchanged file file within an archive, there is also a strict upper bound for duplication. Of particular note, this overcomes all of the shortcomings of implemented prototypes.

The most interesting cases occur with remote desktops and video. From discussions about the shortcomings of Citrix, I argued that video in a web page blurs the distinction between desktops and video. However, it remains common practice to use distinct techniques to maintain the crispness and responsiveness of each while reducing bandwidth, memory and processing power. De-duplicating video greatly reduces the value of streaming quadtree video but it does not eliminate it. Therefore, the numerous cases of video compression should be regarded as one case within this proposal.

In the trivial case, consider a desktop with one byte per pixel and each window (or screen) being a grid of 64×64 pixel tiles. Previously, each tile was divided into ragged quadtrees and each leaf used different encoding under various non-obvious quality metrics. (This is why I was using pictures and video of Elysium, Iron Man and Avril Lavigne as test data.) The current proposal works more like your suggestion to use RLE [Run-Length Encoding]. SPecifically, on a Fibonacci length boundary, of 13 pixels or more, contiguous pixels (plain or patterned) may be shared within one tile, between tiles and across unlimited application window redraws. Most curiously, contiguous pixels may be shared by unlimited users within an office. This allows screen decoration, window decoration and application interface to have references to common sequences of pixels across desktops. This can be achieved in a relatively secure manner by assigning references randomly and dis-connecting clients which attempt to probe for unauthorised chunks of screen.

In the general case, each user has access to zero or more transient storage pools and zero or more persistent storage pools. Each pool may be shared within one trust domain. This includes public read-only or read/write access. Each pool may be encrypted. Each pool may be accessed via 8 byte references and/or file name. Each pool may have a full-text index. And pools may be used in a union because this allows text, HTML, archives and multi-media to be indexed and searched in a useful manner.

However, the feature that I think will interest you the most is the ability shoot, edit and distribute video without using lossy compression. This arrangement has a passing ressemblance to Avid proxy format which stores 16×16 pixel tiles as a 32 bit lossy texture. Instead, 64×64 pixel tiles (or similar) are stored as 64 bit lossless references and no proxy is required. You may think that pixel dance would prevent de-duplication but you are not thinking on a large enough scale. Four variations of brightness across 13 samples is a maximum of 226 (64 million) variations. However, these variations are not restricted to local matches. Variations may be matched almost anywhere across any frame of video. Or, indeed, any data in the same storage pool. Admittedly, savings for the first million videos or so will be dismal but returns are increasingly likely as the volume of data increases. A well known result from search engines is that unique search terms grow proportionately to the square root of the number of documents. This applies to a corpus of documents of any language and any average length. I'm applying the same result to byte sequences within multi-media. For 13 bytes (104 bits), the square root is 252 permutations. Longer sequences occur less frequently but retain an essential supporting rôle.

Lossy compression of audio and video provides better than average matching because pixel dance is smoothed and the remaining data has correlation. Data may be locally compressed with some efficiency but the remaining data is skewed towards some permutations. Any skew improves the probability of matches beyond chance. This becomes ridiculously common when there are exabytes of data within the system. This may seem like an article of faith with shades of network effects and dot com economics but there is an upper bound to this lunacy.

Sun Microsystems found that 2128 bytes is the maximum useful size for a contiguous filing system. In classical physics, the minimum energy required to store or transmit one bit of information is one Planck unit of energy (or the absence of one Planck unit of energy). Very approximately, 264 Planck units of energy is sufficient to boil a kettle and 2128 Planck units of energy is sufficient to boil all of the water in all of the world's oceans. Therefore, operations on a sufficiently large filing system are sufficient to cause environmental havoc. My insight is to take this bound, apply well known results from search engines and resource allocation, and reduce everything successively down to references of 64 bits. If that isn't sufficient then it may not be possible for any solar system to support requirements. In this case, the filing system may be installed adjacent to, inside, or using a black hole. No known vendor supports this option.

I suggested that I work on a presentation while my ex-boss worked on a business plan which explained how to avoid obvious problems. I reduced a market overview and USP [Unique Selling Point] to 23 slides. I haven't seen any business plan but I optimisitically assume that slides have been shown to investors. As a contingency, I am now working on a business plan. This is an extremely competitive market with an alarming number of failures. I prefer to offer on-site storage to the private sector. We may have to offer off-site storage. In the worst case, we may allow one instance to be used as a dumping ground. Sale of apparel is definitely unviable. Advertising is also unlikely to work. Ignore the intrusiveness of advertising; to the extent that it is a security problem. As advertising has become more targetted, the signalling power of an expensive advert is lost. This is why adverts on the web are somewhere between tacky and payola.

In a public storage pool, each sequence of bytes requires an owner and access count. Each owner would be billed periodically for storage and would receive a proportion of retrieval fees. However, an owner may receive a charge-back if fees have been received for trademarks, copyright and public domain. This would be at a punative rate for the purpose of discouraging abuse. Ideally, illegal content would be assigned to a locked account and this would prevent copies being distributed. Unfortunately, retaining a reference version of illegal content is invariably illegal.

In this arrangement, it is expected that the most popular accounts would receive payment, a mid tier would obtain indefinite free use and a bottom tier would pay a nominal fee to access data. Anti-social use would be met with contractual and economic penalty. However, persistent trolls have been known to collectively raise US$12000 or more. So, penalty may have to be very steep.

Illegal content varies with jurisdiction. By different chains of reasoning, it may be illegal to depict almost anyone. In some jurisdictions, depiction of women is illegal. In other jurisdictions (including the ones in which we are most likely to incorporate), equality applies. Therefore, it would be illegal to block depictions of women without also blocking depictions of men. By a different chain, if people are created in God's image, if Allah is God and if depiction of Allah is illegal blasphemy then it is obviously illegal to depict anyone. By another chain, identity and self-identity may also make depiction illegal. Unfortunately, this cannot be resolved by asking someone what they want to view because profiling nationality is illegal in Sweden and profiling religion is illegal in Germany.

However, I have yet to mention the most ludicrous case. In some jurisdictions, such as the UK, illegal child pornography extends to the depiction of cartoons and text files. In the most absurd case, a sequence of punctuation may be illegal. In the case of a de-duplicated filing system, some sequences of punctuation, possibly by the same author, may appear in multiple files. In some files, a sequence of punctuation may be legal. In other files, the same instance of punctuation may be illegal. A similar situation may occur with medical information. A sentence recorded by a doctor may require "secure" storage. The same text may be published by someone else. Some content may be licensed on a discriminatory basis; most commonly this involves separate cases for private sector and/or public sector and/or individuals. This can be resolved contractually by defining all access as commercial. Oddly, this doesn't affect GNU Public License Version 2 but excludes a large amount of photography and music. See previous cases for photography. There is no technological solution for these problems. Indeed, it should be apparent that almost everyone has files which would be illegal in one jurisdiction or another.

Competitor analysis has been more fruitful. One competitor has so much correct and so much completely wrong. TarSnap.Com can be concisely described as a progression of my second prototype. Data is split in a manner which covers all cases. This was achieved with a mathematical proof which leads to the use of a rolling hash. This sequentially splits a file into variable lengths with a skewed Poisson distribution. However, throughput (and compression) is probably terrible because it uses the same zlib compression algorithm and, entertaining, a defensive architecture due to historical buffer overflows in zlib. Marketing and encryption are good but this is all wasted by out-sourcing storage to a direct competitor. Indeed, the system is only viable because the sub-contractor does not charge for the extensive internal bandwidth required to perform integrity checks. So, data is encrypted and the only copy is sent directly to a rival where it is mixed with two tiers of other clients' data. That'll be spectacular when it fails. Unfortunately, that's the competent end of the off-site storage market.

I investigated implementation of a general purpose de-duplicating filing system. It has become accepted practice to develop and deploy filing systems outside a kernel by using FUSE, PUFFS or similar. This increases security, isolation and monitoring. It also allows filing systems to be implemented, in any language, using any existing filing system, database or network interface. In the manner that it is possible to mount a read-only CDROM image and view the contents, it is possible to mount a de-duplicated storage pool if it has a defined format within any other filing system. With suitable hashing and caching, a really dumb implementation would have considerable merit. For my immediate purposes, I could use it as a local cache of current and historical web sites.

Anyhow, please back-check my figures. Optionally, tell me that it is an impractical idea or consider defining your own rôle writing monitoring software, accounting software, video editing software or whatever you'd like to do.

Yours De-Dupididly

Evil Doctor Compressor

I didn't include the slides but the collated text is as follows:-

Fibonacci Filing System

The Problem

  • The volume of digital data is growing exponentially.
  • Per hour, scientific data which is synthesized and extrapolated is greater than all scientific observations from the 20th century.

Moore's Law

  • Components on a silicon chip double every two years.
  • Digital camera quality doubles every 17 months.
  • Computer network capacity doubles every nine months.
  • We are drowning in data!

Coping Strategy: Compression

  • Techniques to reduce volume of data include:-
    • Content-aware compression.
    • Content-oblivious compression.
  • If content is known then compression may be lossy or lossless.
  • Lossy compression is typically applied to audio and video.

Code-Book Compression

  • Succession of compression techniques are increasingly effective but require increasing amount of working memory:-
    • PKZip - Uses tiers of small code-books.
    • GZip - Refinement of PKZip.
    • BZip2 - Burrows-Wheeler Transform.
    • LZMA - Uses code-book with millions of entries but requires up to 900MB RAM.

File De-Duplication

  • A filing system may eliminate storage of redundant data.
  • Covers all cases because it applies to all data handled by all applications.
  • Techniques work at different levels of granularity:-
    • Hash de-duplication applies to checksum of file.
    • Block de-duplication works on fixed-size chunks.
    • Byte de-duplication provides tightest compression.

Hash De-Duplication

  • Used in closed systems to provide a modicum of tamper-proofing.
    • Used in legal and medical systems.
  • Used in open systems to provide a document reference.
    • Root cause of Dropbox and MegaUpload's privacy and copyright problems.
  • Does not work with changing data.

Block De-Duplication

  • Implemented by LLVM, ZFS and other systems.
  • Commonly used to provide daily, rolling snapshots of live systems.
  • Block size is mostly immaterial.

Byte De-Duplication

  • Most effective scheme.
  • Most fragile scheme.
  • Most energy intensive scheme.
  • Candidates for compression may require up to 0.5 million comparison operations.
  • May be implemented with custom hardware to reduce time and energy.

The Unique Selling Point

  • General trend is towards larger block sizes.
  • This provides an approximate fit for the throughput of hard-disks and solid-state storage.
  • However, unconventional block sizes approximate optimal efficiency without the processing overheads.

Fibonacci Numbers

  • Common sequence in mathematics.
  • Discovered by X in X when studying breeding of rabbits but occurs throughout nature:-
    • Flower petals.
    • Pine-cones.
    • Snail shells.
  • Used by humans in Greek architecture and computer memory allocation.

Counting Theorem

  • Fibonacci numbers may be used in compression schemes if the counting theroem is not violated. Specifically, there must be a unique mapping:-
    • From data to code-word.
    • From code-word to data.
  • We propose:-
    • Mapping 13 byte data to eight byte code-word.
    • Mapping 21 byte data to eight byte code-word.

System Capacity

  • LZ compression schemes, such as PKZip, use tiers of code-books with thousands of entries.
  • LZMA compression schemes use one code-book with millions of entries.
  • Proposed scheme use two code-books each with 16 billion billion entries.
  • This is sufficient to hold a maximum of 557056 exabytes of unique data and any multiple of duplicate data.

Theoretical Upper Bound

  • ZFS uses 128 bit addressing on the basis that:-
  • ℏ × 2128 is vast.
  • Where:-
    • ℏ is the minimum detectable quantity of energy
    • and "vast" is enough energy to boil all of the water in all of the world's oceans.

Infinite Monkey Test

  • Can an infinite number of monkeys hitting keys on an infinite number typewriters exhaust storage? Yes but with difficulty.
  • Historical text files used 96 symbols.
  • For 13 byte sequence, this is a maximum of 0.0029% of the possible input permutations but it exceeds the capacity of the code-book by more than factor of three billion.

Infinite Monkey Test

  • Capacity is exceeded if input occurs without duplication.
  • This becomes increasingly difficult as data accumulates within a system.
  • Can be achieved maliciously if pseudo-random is designed such that it never repeats.
  • This case can be countered with a traditional file quota arrangement.
  • It is also trivial to identify anomalous use.

Birthday Paradox

  • Why is it increasingly difficult to exhaust storage capacity?
  • This is due to the counter-intuitive birthday paradox where it is more likely than not that 21 people share a birthday.
  • Worst case for matching is approximately the square of the number of candidates.
  • This bridges the gap between a code-book with 264 entries and upper bound of 2128 (or less).

Video De-Duplication

  • Standard video encoding may greatly reduce volume of data. However:-
    • Compression between adjacent pixels may occur due to reduction in contrast.
    • Compression is unlikely to involve more than 16 frames of video.
  • A global code-book provides additional opportunities for processing lossless or lossy video.

Audio De-Duplication

  • Trivial to share silence in uncompressed audio.
  • De-duplication of near silence increases as examples become more common.
  • Easier to de-duplicate compressed audio.
  • AMR (used in cell phones) uses scheme which minimizes latency and survives high loss and corruption. Data is also fixed-length.
  • Duplicates become increasingly common with duration of conversation.

Clustering

  • An unlimited number of computers may participate in one storage pool.
  • Each node is allocated a unique numeric range within each code-book.
  • New code-words are replicated and distributed using standard techniques.
  • Duplicate mappings may occur within a concurrent system. Normally, this is inconsequential.

Full-Text Indexing

  • Where block size is smaller than an average sentence (and smaller than a long word):-
    • There is an upper bound for the quantity of search-terms within a fragment of text.
    • There is an upper bound for search-term length within a fragment of text.
    • There is strictly zero or one search-terms between adjacent fragments of text.
  • Reduced size and scope means that a search engine is almost a corollary of de-duplication.

Applications

  • Traditional applications include:-
    • Enterprise document and mailbox storage.
    • File versioning and database snapshots.
    • Media distribution.
    • Clustered web caching.

Applications

  • Novel applications include any permutation of Project Xanadu, search engine, media broadcast, remote desktop and virtual reality:-
    • Distributed storage and caching with full-text index.
    • Hyper-text and hyper-media with robust citations and micro-payment.
    • Multi-cast, lossless, live, streaming surround sound and/or high-definition video.

Moto C Plus is the Queen of Budget LineageOS Phones

Posted by Azuma Hazuki on Saturday March 31 2018, @09:20PM (#3115)
6 Comments
Mobile

After nearly 6 years with a Motorola Photon Q (XT897/"Asanti") I finally decided it's time for a new phone, and spent the outrageous sum of just-under-$130-including-shipping for a Moto C Plus, screenglass, and rubber body armor.

Motorola, now a subsidiary of Lenovo, is known for most of its phones being fairly amenable to rooting and unlocking. This particular one also happens to be GSM-enabled and support dual SIMs, on the off chance I ever leave the US and want a data plan. Its specs are, by today's standards, unimpressive: quad-core MT6737M CPU (4x ARMv8 A53 @ 28nm, somewhat slower than a Snapdragon 425 for reference), 2 GB of memory, 16GB of eMMC flash, and 5" 1280x720 screen. The body is all plastic, though it's not bad plastic, and the battery is a surprising 4,000 mAh that weighs more than the phone itself does.

Now, it turns out the C Plus is *not* as easily unlockable as, say, the G4 is. In particular, the usual fastboot commands such as get_unlock_data simply fail with "unknown remote command" errors. The stock ROM is also kind of pants, though it's at least a fairly vanilla Android 7.0 rather than Madokami-forgive-us-all MIUI or TouchWiz.

However, there is a program called SP Flash Tool that is able to write directly to MediaTek devices' internal flash over a USB port. This is not for the faint of heart, as it requires carefully-crafted scatter files with the exact starting addresses and lengths corresponding to each and every piece of the stock firmware, and if you mess up by even one byte, you will very likely hard-brick your phone. For even more heart-pounding excitement, the way to get a custom ROM on here is not to use this tool, but to pop into an advanced mode and specify where to start writing (if you're curious, it's 0x2d80000 and no, that's not a typo) and with what file.

The purpose here is to flash a custom build of TeamWin recovery, known by its uncomplimentary acronym TWRP. And *this* involves dissecting the machine, removing its battery, holding VolDown, and hooking it up to a PC via USB cable, *in that order.* Somehow, Flash Tool is able to write to the device even though it's powered off and battery-less.

From here, disconnect from the PC, hold VolUp and Power, and select Recovery boot. After about 30s, TWRP will load, and you can pull up ADB in your shell, place the device in sideload mode, and "adb sideload /path/to/lineageos-14.1.zip," which goes a hell of a lot faster than you'd expect it to. But there's a catch: if you reboot now, you'll go into an endless bootloop, where the phone won't go past the Motorola logo. If you don't have a stock ROM or, preferably, a Nandroid backup, this is game over.

Turns out you *also* need to install SuperSU, and you need to do it in a special way: while in TWRP, pop a terminal, and do "echo SYSTEMLESS=true>/data/.supersu" before adb-sideloading a known-good SuperSU .zip onto the phone. The output will be quite verbose, and it will warn you that 1) on reboot, the device will likely restart at least once and 2) first boot will take "several minutes."

They're not kidding. I lost count of the exact amount of time, but I believe LineageOS sat there blowing bubbles to itself for a good 15 or 20 minutes, and I was literally seconds away from forcing a power-down and starting the entire process again with a stock ROM. It also takes well over a minute to boot; I counted 33 sets of right-to-left bubbles at the LineageOS loading screen per bootup, and that's after a good 30 seconds of the phone sitting at the Motorola logo with a nice fat warning about how the device can't be verified and might not work properly.

Yeah, that's the point: from Google's PoV, it's *not* working properly, because LineageOS seems not to have all that spying junk on it. I didn't even flash a GApps zip; instead, first thing I did was to sideload an FDroid archive, which is something like an open source version of the Play Store. Everything I need is on there and more.

So far, I am loving this thing. I've been zipping all over downtown Milwaukee looking for a new place to lease the last couple of days, and there is nothing like having a portable MP3 player for those long rides. And get this: after using it to play music for a good 3 hours, *the battery was still at 95% from a full charge this morning.* That is *nuts.* The Moto C Plus punches way, way above its weight with the proper love and attention paid to it.

So if anyone wants a good budget phone to run LineageOS, Resurrection Remix, etc on and doesn't mind doing some nailbitingly-scary stuff with direct flashing, I can heartily recommend the C Plus (NOT the C, that thing is junk). Happy flashing, everyone :)

Could we expose journals to search engines?

Posted by khallow on Saturday March 31 2018, @06:26AM (#3114)
23 Comments
Slash
A few days back, I was trying to remember a post I had written on a journal article of mine. I had no luck searching for it in Google so I ended up going through the journals until I found the post.

That inconvenience got me thinking. I've never seen a journal or journal post in a search result. So it appears to me that we're blocking the web crawlers from accessing journals. Now some sort of protection needs to be in place to keep SEO spammers from taking over the journal section. I grant that. But cool stuff ends up in the journals. We should show it to the world and maybe pull the world in a little in response.

So my first argument is that allowing most journals to be searchable would make them just as convenient to find as regular posts. Second, there's the cool stuff argument. For example, cafebabe or MDC (hey, even aristarchus carries his part of the load) come up with some crazy/cool stuff. It'll pull in people who would never care about the high comment stories.

Third, this has actually worked before. Way back when, Kuro5hin.org was yet another Slashdot replacement which had a similar layout and journals (called "diaries"). Several Soylentils such as myself, MDC, and mcgrew, have posted diary entries there which in turn were exposed to the world. Sometimes the degree of exposure got crazy. For example, my most cited academic paper is a diary on a defunct Department of Defense betting market that was killed off by political ignorance. The amusing thing is that the above article is occasionally cited by my user name "khallow" instead of by my real name.

Then there is the epic Fuck Natalee Holloway story by author, gbd. It had 1420 comments on it. That's because there was this peculiar, overwhelming obsession in the media and public with a missing, white, female tourist from a Caribbean island with this Kuro5hin.org journal being one of the scarce pieces of dissent at the time. This voice crying out in the wilderness got a lot of attention as a result.

So anyway, journal articles can contribute to the visibility and culture of SN in a useful way, if they're exposed to search engines.

Moving on, I grant that we'll need to have some sort of modest restrictions just so we don't get spam crowding out the good stuff. There are two ways spam can sneak/surge into journal articles. First, spammers can create their own journals, we've already had problems with that. Maybe put a robot.txt block on new journals from posters without a certain level of karma?

Second, spammers can dump posts into ancient journal articles (since comments on most journals stay open indefinitely, right?). Imagine a spammer who posts their penis enlargement ads into year old journal articles. Who would notice? Only the search engines would, which would be the point. If we can manage to make journals more visible without turning the section into a spam factory, it'd be greatly beneficial overall. Journals should be an essential part of our community and putting them in as near equal with regular articles in the search engines would go a long way to making that happen. They also can provide a draw to interests that aren't being covered by mere news.

std::map<optimism, chagrin> disaster;

Posted by turgid on Tuesday March 27 2018, @08:18PM (#3105)
44 Comments
Code

I have been at the C++ again. After a few years I have been slowly managing to persuade people that directly testing (using TDD) the C++ code is a good idea.

Also, I have tried to put my C smugness and arrogance away in the spirit of doing things "the right way" i.e. in C++ and the way the earnest and eternally vigilant members of the C++ Inquisition would recommend.

A couple of weekends ago I was on a fairly long train journey so for entertainment I reacquainted myself with the C++ Frequently Questioned Answers and laughed out loud a couple of times much to the bemusement of Mrs Turgid.

I had been asked to supervise a much younger and inexperienced member of the team. He had too much to do and so I was asked to pick up some work he had started. Young people today... So I extracted some of his code into independent methods and put them under test with CPPUNIT which involved hacking on some nasty ANT build scripts (don't get me started...) just to add a few .so files to the linker command line. The build scripts are so bad that it takes upwards of 45 seconds to compile, link and run the unit tests (200 lines of code).

Now to the fun, std::map. Why oh why oh why? Well, because the STL and these are "algorithms" and they've been developed by people much cleverer than you and so they won't have bugs like the ones you would write yourself and they have performance criteria and they use templates so you get type checking at compile time and blah blah blah...

Yes, well, nobody expects the C++ Inquisition. Their main weapon is type safety and code reuse. OK, their two main weapons are type safety, code reuse and generics. Hang on, that's three. I'll come in again. Nobody expects the C++ Inquisition. Amongst their weapons are type safety, code reuse, generics, multiple inheritance, virtual methods, references, the STL... You get the idea.

And what was std::map being used for? To store pairs of strings and integers (hex) read out of an ASCII configuration file. How was the file parsed? sscanf()? No, some fancy stream object with operator<<. And what were the ASCII strings? Names of parameters. And there was a third column in the file that specified a width and was summarily ignored by the parser. And what about the names of the parameters? Well, they were looked up in the map at run time, hard-coded, to pull the values out of the map and put into internal variables with all kinds of shifts and shuffles on byte order. And what if the user changed the names of any of the parameters in the file? Yes, what indeed. The user will be editing this file.

Now I do need to use some sort of dynamic data structure myself in this project. I need to map strings to integers, but with integers as the keys this time. My table needs to be populated with the names of files read from a directory and the files sorted in order. If I were doing this in a sane language like C it would be relatively straight forward. Anyway, we're in C++ land now and the C++ Inquisition are in attendance. So I thought I'd take a leaf out of their book and use std::map<uint32_t, std::string> table or something (note the code is infected with stds all over the place, another cool feature) so I decided I'd better read the documentation. I thought I might use the insert() method and check for duplicate keys in the map. Nope, template error. It seems one must use operator[] but that doesn't check for existing keys, it just overwrites them. The suggested remedy? Ah, scan the entire map from the beginning each time to make sure the key isn't already there. Doesn't it throw one of these pesky exception things? I thought they were the Modern Way(TM)?

::iterator is fun. Try to iterate over an empty map, or to an entry that isn't there. How do you detect it? Well, ::iterator is some kind of pointer (you get at the data with ->first or ->second) so you might compare with NULL (sorry, 0 nowadays) but no way because operator== is not defined. The best advice is not to try to iterate over an empty map or to dereference an iterator that doesn't point to anything.

I could have read my file names into a (sorted) linked list checking for duplicates along the way. It would have been less than 50 lines of C, and I could have written it and tested it in the time it took me to get angry about C++ all over again.

The word is chagrin. I have wasted very precious time and haven't even got any working code.

Edited 20180328 to use proper escape codes for angle brackets.

Trump Signs $1.3 Trillion Spending Bill

Posted by DeathMonkey on Friday March 23 2018, @07:55PM (#3097)
11 Comments
News

Reversing his veto threat, Trump signs the $1.3 Trillion dollar spending bill.

National Review says it's the Biggest Spending Increase Since 2009.

Trump briefly threatened to veto it. Mostly because it didn't spend enough: “... the BORDER WALL, which is desperately needed for our National Defense, is not fully funded,”

Spending wouldn't be such a problem except for the fact they also cut revenue by a over a $trillion with the tax bill.

I'm no mathematician but that seems to put a bit of crimp in his promise to eliminate the national debt in eight years.

If You Have Nothing To Hide, You Have Nothing To Fear...

Posted by NotSanguine on Saturday March 17 2018, @11:31PM (#3084)
23 Comments
News

Isn't that what the government types keep saying?

So why is it that the head of that self-same government doesn't play by those rules? Does he think the law doesn't apply to him?

In my mind, those who are privileged to represent us should be *more* forthcoming and avoid even the hint of any conflicts of interest, illegality or lack of transparency.

Why then does our current administration fear the Mueller investigation? Let's get to the truth and let the chips fall where they may.

Which is why it seems odd that our 'fearless' leader is acting like a frightened animal:

A lawyer for President Trump called on the Justice Department on Saturday to end the special counsel investigation into ties between Russia and the Trump campaign, shifting abruptly to a more adversarial stance as the inquiry appeared to be intensifying.
[...]
Mr. Dowd said at first that he was speaking on behalf of the president but later backed off that assertion. He did not elaborate on why he was calling for the end of the investigation, saying only: “Just end it on the merits in light of recent revelations.”

People close to the president were skeptical that Mr. Dowd was acting on his own. Mr. Trump has a history of using advisers to publicly test a message, giving him some distance from it. And Mr. Dowd’s comments came at a time when members of Mr. Trump’s legal team are jockeying to stay in his favor.

Hours later, the president echoed Mr. Dowd’s accusations of corruption in the theoretical “deep state” that Mr. Trump has long cast as a boogeyman working to undermine him.

”There was tremendous leaking, lying and corruption at the highest levels of the FBI, Justice & State,” he wrote on Twitter.

In the end, the truth will out. Let's hear it. Whether it's that there was no collusion with the Russians or if there was. That's actually less important than making sure we have the facts.

Armed teacher becomes accidental school shooter, injures 3

Posted by DeathMonkey on Wednesday March 14 2018, @11:19PM (#3077)
15 Comments
News

Just in case we were wondering if arming teachers is a good idea:

As thousands of students walked out of their schools on Wednesday to pressure Congress to approve gun control legislation, three other students were healing from wounds inflicted when a teacher’s firearm accidentally discharged in a California classroom.

The teacher, Dennis Alexander, who is also a city councilman in Seaside, Calif., was showing the students a gun on Tuesday during his advanced public safety class at Seaside High School when the gun accidentally went off, Marci McFadden, a spokeswoman for the Monterey Peninsula Unified School District, said in a phone interview on Wednesday.

Across the country, another school was also investigating a weapon that discharged accidentally this week. The Alexandria Police Department in Virginia said that a school resource officer accidentally fired his gun inside his office at George Washington Middle School on Tuesday morning. Nobody was hurt, the police said.

Teacher’s Gun Is Accidentally Fired During Public Safety Class, Injuring 3