I might get funding for a de-duplicated filing system. I took the opportunity to send this to a friend with a background in physics, accounting, video editing, COBOL, Java and Windows:-
I'm feeling flush so I'm sending you a letter. Don't complain that you only receive demands for money. You also get the occasional note from a nutjob.
I won't mention health because it can be extremely frustrating to read about such matters. However, I assume that we're both ambling along to some extent. I made moderate progress with electronics and processor design but nothing of particular merit. Gardening is also progressing. I'm growing some parsley for when we next have fish.
I think that I stopped my ex-boss from bidding on chunks of old super-computer clusters. Even if GPUs weren't cheaper, we'd soon get caught by the horrendous cost of electricity. Due to Moore's law, a 10 year computer is likely to require at least 30 times more energy per unit of computation. I think this also stopped various cryptographic currency schemes which would presumably run on said hardware.
I presume that my ex-boss continues to seek funding from investors diversifying out of minerals. That would explain continued focus on hardware and high finance. My ex-boss recently suggested a scheme involving file storage. The unstated assumption would be storage is not out-sourced to a third-party. In particular, this would avoid the dumb-ass case of out-sourcing storage to a direct competitor.
I was initially hostile to this idea due to previous research. The technology is viable but the market invariably has one party which accidentally or deliberately prices storage below cost. It only takes one idiot (or genius) to skew the market for everyone. There are also periods when one party freely hosts illicit content and flames out. MegaUpload and RapidShare are two examples.
Anyhow, there is an outside chance that the third iteration of my de-duplicating filing system may gain significant funding and widespread deployment. We had a long discussion about de-duplication during a two hour walk. I'm not sure if you followed all of the details. Aplogies if you followed very closely.
The first iteration of my work was prototyped inside a database. It used fixed length records and implemented block de-duplication. Deployed systems systems typically use anything from 1KB blocks to 4MB blocks (with alarming variation in the checks for unique data). Larger block sizes are better for databases and video. Unexpectedly, 4KB works particularly well with legacy Microsoft Office documents and you can confirm for yourself that these files are exact multiples of 4KB. However, block de-duplication fails horribly with PKZip files. This is important because PKZip is the basis for numerous contemporary file formats, including Java Archives and .docx
The second iteration of my work was attempt to migrate to variable length blocks. The implementation split files at markers typically found within JPEGs and PKZip files. However, I never found a general case for this process and work was slowed because performance was awful. Compression improved the best case at the expense of the worst case. It also led to the problem of how to efficiently store thousands of different lengths of compressed blocks. When prototyping in a database, this can be delegated to the database but how would this be implemented efficiently in a file of raw disk partition? A solution must exist because it was widely used on desktop computers via DoubleSpace's Stacker and MSDOS6. The computer scientist, Donald Knuth, possibly in 1964, described Fibonacci lengths for memory allocation. This is more efficient than the obvious powers of two because the ratio grows by 1.6 and therefore waste is reduced. With some concessions for powers of two, this is implemented in Sun Microsystems' Arena memory allocator. The same principle applies to storage. For prototyping, a crude arrangement is a de-normalised database with one fixed length database table for each Fibonacci length. There is an analogous arrangement for raw disk.
None of these prototypes handled random access or full-text indexing, like EMC's systems. I was also concerned that, in the most pessimistic case, 3/8 of storage was wasted and, in the typical case, 3/16 is wasted. Waste can be reduced by tweaking figures. However, it should be apparent that there are diminishing returns that cannot be eliminated - unless the uncompressed data is stored in blocks of Fibonacci length.
That's the basis for the third iteration. Although it overcomes previous limitations, it has not been prototyped because it is expected to require far more work than the previous iterations. This is particularly true after a similar number of iterations prototyping full-text indexing. Someone sensible and numerate should ensure that talk of Fibonacci numbers, golden ratios, factorials, random walks, birthday paradoxes and Poisson distributions are not a numeralogical fringe lunatic heading towards a third educational but uncompetitive implementation. (If you're mad enough, there may also be opportunity to define your own rôle in this venture.)
I've been surprised how far prototypes scale. 1KB blocks with a 32 bit primary key allow 242 bytes (4TB) to be stored - and this was never a limitation when it only transfered 11 blocks per second. Search engines are similar. It is ridiculously easy to spread search terms over 4096 hash elements and compress the data to increase retrevial speed by a factor of four. This works for millions of web pages without difficulty. However, at scale, a comprehensive query syntax requires at least six tiers of application servers.
For Fibonacci file de-duplication, the base case is a one byte block. Or perhaps a one bit block. That's obviously ridiculous but it is feasible to map 13 byte sequences to an 8 byte reference and 21 byte sequences to a different set of 8 byte references. You may be incredulous that such a system works. If you are able to refute this assertion then you would save me considerable work. However, if I am correct then we may have the basis of a venture where units historically sold for US$20000 and clusters up to US$3 million. These figures are optimistic but managed systems generally sold for about 20 times more than the cost of the hardware.
I still like your idea of putting a weedy computer into a 4U box primarily so that it may display a huge logo on the front. Unfortunately, implementation is likely to require substantial quantities of RAM. Rules of thumb indicate that ISAM databases typically scale about 10 times further than cache RAM. More advanced schemes, such as InnoDB, typically scale about 100 further than cache RAM. ZFS scales about 1000 times further than cache RAM. Given the proposed structure, it is extremely optimistic to allocate one byte of RAM per kilobyte of storage. Systems which fail to observe this ratio may encounter a sudden performance wall. This precludes the use of very small computers without RAM expansion. In particular, a Raspberry Pi is not suitable to control a dozen hard-disks. A Raspberry Pi is not sufficiently reliable but it should give an indication that proper server hardware is required with all of the associated costs.
I hope to convince you that such a system would work. I would achieve this by defining a huge number of cases and mostly showing that cases can be reduced to previous cases. The most trivial case is opening an uncompressed text file, inserting one comma and saving the file with a different name. In this case, blocks (of any size) prior to the change may be shared among both versions of the file. If the block size is large then the tail of any file has minimal opportunity to obtain the benefits of block de-duplication. However, if the blocks are 21 bytes or smaller then then common blocks are likely to occur among five versions or so. In the worst case, the probability of 21 files having no common tail is the inverse of the factorial of 21. This is less than 1/(5×1019). And for the 22nd case, there is zero probability that savings do not occur.
From empirical testing, formal names, common phrases and URLs a easily de-duplicated. Likewise, the boilerplate of database websites has common sequences. Take 1000 web pages from any forum or encyclopedia and it is trivial to reduce the sequences by a factor of three. You may be concerned that this scheme fails when it encounters the first binary file. This is not the case. Identical files are trivial to de-duplicate. Identical files with staggered offsets (as is typical within email attachments) have a strict upper bound on duplication. Compressed files are similarly ameniable. For example, within PKZip archives, compression re-starts for each file within the archive. THerefore, for each unchanged file file within an archive, there is also a strict upper bound for duplication. Of particular note, this overcomes all of the shortcomings of implemented prototypes.
The most interesting cases occur with remote desktops and video. From discussions about the shortcomings of Citrix, I argued that video in a web page blurs the distinction between desktops and video. However, it remains common practice to use distinct techniques to maintain the crispness and responsiveness of each while reducing bandwidth, memory and processing power. De-duplicating video greatly reduces the value of streaming quadtree video but it does not eliminate it. Therefore, the numerous cases of video compression should be regarded as one case within this proposal.
In the trivial case, consider a desktop with one byte per pixel and each window (or screen) being a grid of 64×64 pixel tiles. Previously, each tile was divided into ragged quadtrees and each leaf used different encoding under various non-obvious quality metrics. (This is why I was using pictures and video of Elysium, Iron Man and Avril Lavigne as test data.) The current proposal works more like your suggestion to use RLE [Run-Length Encoding]. SPecifically, on a Fibonacci length boundary, of 13 pixels or more, contiguous pixels (plain or patterned) may be shared within one tile, between tiles and across unlimited application window redraws. Most curiously, contiguous pixels may be shared by unlimited users within an office. This allows screen decoration, window decoration and application interface to have references to common sequences of pixels across desktops. This can be achieved in a relatively secure manner by assigning references randomly and dis-connecting clients which attempt to probe for unauthorised chunks of screen.
In the general case, each user has access to zero or more transient storage pools and zero or more persistent storage pools. Each pool may be shared within one trust domain. This includes public read-only or read/write access. Each pool may be encrypted. Each pool may be accessed via 8 byte references and/or file name. Each pool may have a full-text index. And pools may be used in a union because this allows text, HTML, archives and multi-media to be indexed and searched in a useful manner.
However, the feature that I think will interest you the most is the ability shoot, edit and distribute video without using lossy compression. This arrangement has a passing ressemblance to Avid proxy format which stores 16×16 pixel tiles as a 32 bit lossy texture. Instead, 64×64 pixel tiles (or similar) are stored as 64 bit lossless references and no proxy is required. You may think that pixel dance would prevent de-duplication but you are not thinking on a large enough scale. Four variations of brightness across 13 samples is a maximum of 226 (64 million) variations. However, these variations are not restricted to local matches. Variations may be matched almost anywhere across any frame of video. Or, indeed, any data in the same storage pool. Admittedly, savings for the first million videos or so will be dismal but returns are increasingly likely as the volume of data increases. A well known result from search engines is that unique search terms grow proportionately to the square root of the number of documents. This applies to a corpus of documents of any language and any average length. I'm applying the same result to byte sequences within multi-media. For 13 bytes (104 bits), the square root is 252 permutations. Longer sequences occur less frequently but retain an essential supporting rôle.
Lossy compression of audio and video provides better than average matching because pixel dance is smoothed and the remaining data has correlation. Data may be locally compressed with some efficiency but the remaining data is skewed towards some permutations. Any skew improves the probability of matches beyond chance. This becomes ridiculously common when there are exabytes of data within the system. This may seem like an article of faith with shades of network effects and dot com economics but there is an upper bound to this lunacy.
Sun Microsystems found that 2128 bytes is the maximum useful size for a contiguous filing system. In classical physics, the minimum energy required to store or transmit one bit of information is one Planck unit of energy (or the absence of one Planck unit of energy). Very approximately, 264 Planck units of energy is sufficient to boil a kettle and 2128 Planck units of energy is sufficient to boil all of the water in all of the world's oceans. Therefore, operations on a sufficiently large filing system are sufficient to cause environmental havoc. My insight is to take this bound, apply well known results from search engines and resource allocation, and reduce everything successively down to references of 64 bits. If that isn't sufficient then it may not be possible for any solar system to support requirements. In this case, the filing system may be installed adjacent to, inside, or using a black hole. No known vendor supports this option.
I suggested that I work on a presentation while my ex-boss worked on a business plan which explained how to avoid obvious problems. I reduced a market overview and USP [Unique Selling Point] to 23 slides. I haven't seen any business plan but I optimisitically assume that slides have been shown to investors. As a contingency, I am now working on a business plan. This is an extremely competitive market with an alarming number of failures. I prefer to offer on-site storage to the private sector. We may have to offer off-site storage. In the worst case, we may allow one instance to be used as a dumping ground. Sale of apparel is definitely unviable. Advertising is also unlikely to work. Ignore the intrusiveness of advertising; to the extent that it is a security problem. As advertising has become more targetted, the signalling power of an expensive advert is lost. This is why adverts on the web are somewhere between tacky and payola.
In a public storage pool, each sequence of bytes requires an owner and access count. Each owner would be billed periodically for storage and would receive a proportion of retrieval fees. However, an owner may receive a charge-back if fees have been received for trademarks, copyright and public domain. This would be at a punative rate for the purpose of discouraging abuse. Ideally, illegal content would be assigned to a locked account and this would prevent copies being distributed. Unfortunately, retaining a reference version of illegal content is invariably illegal.
In this arrangement, it is expected that the most popular accounts would receive payment, a mid tier would obtain indefinite free use and a bottom tier would pay a nominal fee to access data. Anti-social use would be met with contractual and economic penalty. However, persistent trolls have been known to collectively raise US$12000 or more. So, penalty may have to be very steep.
Illegal content varies with jurisdiction. By different chains of reasoning, it may be illegal to depict almost anyone. In some jurisdictions, depiction of women is illegal. In other jurisdictions (including the ones in which we are most likely to incorporate), equality applies. Therefore, it would be illegal to block depictions of women without also blocking depictions of men. By a different chain, if people are created in God's image, if Allah is God and if depiction of Allah is illegal blasphemy then it is obviously illegal to depict anyone. By another chain, identity and self-identity may also make depiction illegal. Unfortunately, this cannot be resolved by asking someone what they want to view because profiling nationality is illegal in Sweden and profiling religion is illegal in Germany.
However, I have yet to mention the most ludicrous case. In some jurisdictions, such as the UK, illegal child pornography extends to the depiction of cartoons and text files. In the most absurd case, a sequence of punctuation may be illegal. In the case of a de-duplicated filing system, some sequences of punctuation, possibly by the same author, may appear in multiple files. In some files, a sequence of punctuation may be legal. In other files, the same instance of punctuation may be illegal. A similar situation may occur with medical information. A sentence recorded by a doctor may require "secure" storage. The same text may be published by someone else. Some content may be licensed on a discriminatory basis; most commonly this involves separate cases for private sector and/or public sector and/or individuals. This can be resolved contractually by defining all access as commercial. Oddly, this doesn't affect GNU Public License Version 2 but excludes a large amount of photography and music. See previous cases for photography. There is no technological solution for these problems. Indeed, it should be apparent that almost everyone has files which would be illegal in one jurisdiction or another.
Competitor analysis has been more fruitful. One competitor has so much correct and so much completely wrong. TarSnap.Com can be concisely described as a progression of my second prototype. Data is split in a manner which covers all cases. This was achieved with a mathematical proof which leads to the use of a rolling hash. This sequentially splits a file into variable lengths with a skewed Poisson distribution. However, throughput (and compression) is probably terrible because it uses the same zlib compression algorithm and, entertaining, a defensive architecture due to historical buffer overflows in zlib. Marketing and encryption are good but this is all wasted by out-sourcing storage to a direct competitor. Indeed, the system is only viable because the sub-contractor does not charge for the extensive internal bandwidth required to perform integrity checks. So, data is encrypted and the only copy is sent directly to a rival where it is mixed with two tiers of other clients' data. That'll be spectacular when it fails. Unfortunately, that's the competent end of the off-site storage market.
I investigated implementation of a general purpose de-duplicating filing system. It has become accepted practice to develop and deploy filing systems outside a kernel by using FUSE, PUFFS or similar. This increases security, isolation and monitoring. It also allows filing systems to be implemented, in any language, using any existing filing system, database or network interface. In the manner that it is possible to mount a read-only CDROM image and view the contents, it is possible to mount a de-duplicated storage pool if it has a defined format within any other filing system. With suitable hashing and caching, a really dumb implementation would have considerable merit. For my immediate purposes, I could use it as a local cache of current and historical web sites.
Anyhow, please back-check my figures. Optionally, tell me that it is an impractical idea or consider defining your own rôle writing monitoring software, accounting software, video editing software or whatever you'd like to do.
Yours De-Dupididly
Evil Doctor Compressor
I didn't include the slides but the collated text is as follows:-
Fibonacci Filing System
The Problem
The volume of digital data is growing exponentially.
Per hour, scientific data which is synthesized and extrapolated is greater than all scientific observations from the 20th century.
Moore's Law
Components on a silicon chip double every two years.
Digital camera quality doubles every 17 months.
Computer network capacity doubles every nine months.
We are drowning in data!
Coping Strategy: Compression
Techniques to reduce volume of data include:-
Content-aware compression.
Content-oblivious compression.
If content is known then compression may be lossy or lossless.
Lossy compression is typically applied to audio and video.
Code-Book Compression
Succession of compression techniques are increasingly effective but require increasing amount of working memory:-
PKZip - Uses tiers of small code-books.
GZip - Refinement of PKZip.
BZip2 - Burrows-Wheeler Transform.
LZMA - Uses code-book with millions of entries but requires up to 900MB RAM.
File De-Duplication
A filing system may eliminate storage of redundant data.
Covers all cases because it applies to all data handled by all applications.
Techniques work at different levels of granularity:-
Used in closed systems to provide a modicum of tamper-proofing.
Used in legal and medical systems.
Used in open systems to provide a document reference.
Root cause of Dropbox and MegaUpload's privacy and copyright problems.
Does not work with changing data.
Block De-Duplication
Implemented by LLVM, ZFS and other systems.
Commonly used to provide daily, rolling snapshots of live systems.
Block size is mostly immaterial.
Byte De-Duplication
Most effective scheme.
Most fragile scheme.
Most energy intensive scheme.
Candidates for compression may require up to 0.5 million comparison operations.
May be implemented with custom hardware to reduce time and energy.
The Unique Selling Point
General trend is towards larger block sizes.
This provides an approximate fit for the throughput of hard-disks and solid-state storage.
However, unconventional block sizes approximate optimal efficiency without the processing overheads.
Fibonacci Numbers
Common sequence in mathematics.
Discovered by X in X when studying breeding of rabbits but occurs throughout nature:-
Flower petals.
Pine-cones.
Snail shells.
Used by humans in Greek architecture and computer memory allocation.
Counting Theorem
Fibonacci numbers may be used in compression schemes if the counting theroem is not violated. Specifically, there must be a unique mapping:-
From data to code-word.
From code-word to data.
We propose:-
Mapping 13 byte data to eight byte code-word.
Mapping 21 byte data to eight byte code-word.
System Capacity
LZ compression schemes, such as PKZip, use tiers of code-books with thousands of entries.
LZMA compression schemes use one code-book with millions of entries.
Proposed scheme use two code-books each with 16 billion billion entries.
This is sufficient to hold a maximum of 557056 exabytes of unique data and any multiple of duplicate data.
Theoretical Upper Bound
ZFS uses 128 bit addressing on the basis that:-
ℏ × 2128 is vast.
Where:-
ℏ is the minimum detectable quantity of energy
and "vast" is enough energy to boil all of the water in all of the world's oceans.
Infinite Monkey Test
Can an infinite number of monkeys hitting keys on an infinite number typewriters exhaust storage? Yes but with difficulty.
Historical text files used 96 symbols.
For 13 byte sequence, this is a maximum of 0.0029% of the possible input permutations but it exceeds the capacity of the code-book by more than factor of three billion.
Infinite Monkey Test
Capacity is exceeded if input occurs without duplication.
This becomes increasingly difficult as data accumulates within a system.
Can be achieved maliciously if pseudo-random is designed such that it never repeats.
This case can be countered with a traditional file quota arrangement.
It is also trivial to identify anomalous use.
Birthday Paradox
Why is it increasingly difficult to exhaust storage capacity?
This is due to the counter-intuitive birthday paradox where it is more likely than not that 21 people share a birthday.
Worst case for matching is approximately the square of the number of candidates.
This bridges the gap between a code-book with 264 entries and upper bound of 2128 (or less).
Video De-Duplication
Standard video encoding may greatly reduce volume of data. However:-
Compression between adjacent pixels may occur due to reduction in contrast.
Compression is unlikely to involve more than 16 frames of video.
A global code-book provides additional opportunities for processing lossless or lossy video.
Audio De-Duplication
Trivial to share silence in uncompressed audio.
De-duplication of near silence increases as examples become more common.
Easier to de-duplicate compressed audio.
AMR (used in cell phones) uses scheme which minimizes latency and survives high loss and corruption. Data is also fixed-length.
Duplicates become increasingly common with duration of conversation.
Clustering
An unlimited number of computers may participate in one storage pool.
Each node is allocated a unique numeric range within each code-book.
New code-words are replicated and distributed using standard techniques.
Duplicate mappings may occur within a concurrent system. Normally, this is inconsequential.
Full-Text Indexing
Where block size is smaller than an average sentence (and smaller than a long word):-
There is an upper bound for the quantity of search-terms within a fragment of text.
There is an upper bound for search-term length within a fragment of text.
There is strictly zero or one search-terms between adjacent fragments of text.
Reduced size and scope means that a search engine is almost a corollary of de-duplication.
Applications
Traditional applications include:-
Enterprise document and mailbox storage.
File versioning and database snapshots.
Media distribution.
Clustered web caching.
Applications
Novel applications include any permutation of Project Xanadu, search engine, media broadcast, remote desktop and virtual reality:-
Distributed storage and caching with full-text index.
Hyper-text and hyper-media with robust citations and micro-payment.
My Ideal Filing System, Part 2
I might get funding for a de-duplicated filing system. I took the opportunity to send this to a friend with a background in physics, accounting, video editing, COBOL, Java and Windows:-
I didn't include the slides but the collated text is as follows:-
Post Comment