from the bitrot dept.
David Rosenthal discusses the last 25 years of digital preservation efforts in regards to academic journals. It's a long-standing problem and discontinued journals continue to disappear from the Internet. Paper, microfilm, and microfiche are slow to degrade and are decentralized and distributed. Digital media are quick to disappear and the digital publications are usually only in a single physical place leading to single point of failure. It takes continuous, unbroken effort and money to keep digital publications accessible even if only one person or institution wishes to retain acccess. He goes into the last few decades of academic publishing and how we got here and then brings up 4 points abuot preservation, especially in regards to Open Access publishing.
Lesson 1: libraries won't pay enough to preserve even subscription content, let alone open-access content.
[...] Lesson 2: No-one, not even librarians, knows where most of the at-risk open-access journals are.
[...] Lesson 3: The production preservation pipeline must be completely automated.
[...] Lesson 4: Don't make the best be the enemy of the good. I.e. get as much as possible with the available funds, don't expect to get everything.
He posits that focus should be on the preservation of the individual articles, not the journals as units.
(2020) Internet Archive Files Answer and Affirmative Defenses to Publisher Copyright Infringement Lawsuit
(2018) Vint Cerf: Internet is Losing its Memory
(2014) The Importance of Information Preservation
As the world slowly moves towards a 100% digital existence, and increasingly consumes their information online, we run the risk of destroying our own legacy. Consider this hypothetical future narrative:
Historians are at a loss to explain the demise of the first pan-human civilisation, as although they agree that the populous dwindled and went almost extinct at around AD 3500, there seems to be no surviving written historical records that can be dated any later than circa AD 2000.
It can only be assumed that around this time, that there was a sudden uptake of illiteracy, maybe caused by a new religion or global-governmental policy. There are surviving references to an organization or group known as the Inter Nets. We can only guess at what this actually was, but the commonly accepted theory is that it was actually some type of wearable mesh harness that prevented humans of this era from actually writing anything down.
Sound ridiculous? I'm not so sure. As information is continually and fully migrated from the printed page and on to the Internet we lose the permanency that a book or ancient scroll brings. Paper and parchment when stored correctly can survive for thousands of years, and if not, the information held within can be transcribed in to replacement volumes when required. If it wasn't for the (well documented) fire that destroyed the Library of Alexandria we'd still have knowledge of the information that was contained there today.
Vint Cerf, the godfather of the Internet, spoke in Sydney, Australia on Wednesday and issued a blunt call to action for a digital preservation regime for content and code to be quickly put in place to counter the existing throwaway culture that denies future generations an essential window into life in the past. He emphasized that this was especially needed for the WWW. Due to the volatile nature of electronic storage media as well as the format in which information is encoded, it is not possible to preserve digital material without prior planning and action.
[...] While the digital disappearance phenomenon is one which has so far mainly vexed official archivists and librarians for some years now, Cerf's take is that as everything goes from creation, the risk of accidental or careless memory loss increases correspondingly.
Archivists have for decades fought publicly for open document formats to hedge against proprietary and vendor risks – especially when classified material usually can only be made public after 30 to 50 years, sometimes longer.
From iTnews : Internet is losing its memory: Cerf
Internet Archive Tells Court its Digital Library is Protected Under Fair Use
The Internet Archive has filed its answer and affirmative defenses in response to a copyright infringement lawsuit filed by a group of publishers. Among other things, IA believes that its work is protected under the doctrine of fair use and the safe harbor provisions of the DMCA.
[...] The statement spends time explaining the process of CDL – Controlled Digital Lending – noting that the Internet Archive provides a digital alternative to traditional libraries carrying physical books. As such, it "poses no new harm to authors or the publishing industry."
[...] "The Internet Archive has made careful efforts to ensure its uses are lawful. The Internet Archive's CDL program is sheltered by the fair use doctrine, buttressed by traditional library protections. Specifically, the project serves the public interest in preservation, access and research—all classic fair use purposes," IA's answer reads.
"As for its effect on the market for the works in question, the books have already been bought and paid for by the libraries that own them. The public derives tremendous benefit from the program, and rights holders will gain nothing if the public is deprived of this resource."
Internet Archive's Answer and Affirmative Defenses (PDF).
Previously: Internet Archive Suspends E-Book Lending "Waiting Lists" During U.S. National Emergency
Authors Fume as Online Library "Lends" Unlimited Free Books
Publishers Sue the Internet Archive Over its Open Library, Declare it a Pirate Site
Internet Archive Ends "Emergency Library" Early to Appease Publishers
EFF and California Law Firm Durie Tangri Defending Internet Archive from Publisher Lawsuit
(Score: 0) by Anonymous Coward on Sunday September 20 2020, @06:07PM (1 child)
Doesn't the Internet Archive / Wayback Machine archive these sites?
If you have some you want archived, then just add them manually (but I have a feeling that the IA already archives this whole class of data?)
(Score: 0) by Anonymous Coward on Sunday September 20 2020, @07:52PM
Trust no one, especially 'free' ones.
(Score: 2) by looorg on Sunday September 20 2020, @06:13PM (3 children)
Overall isn't the digital age just going to be one big black hole as far as a lot of historians will and are concerned? Sure we lost a lot of things from the past but some of the stuff gets preserved. Digital can all just go *poof* or get deleted at a moments notice. Then there is all the encrypted data that might just exist then as big data blobs we can't access so they might as well almost be deleted or gone.
That said I guess it's going to be somewhat similar in science for the future, except that a lot of the important work gets reprinted and reused a lot. So it won't go away unless we are talking apocalyptic changes to society. But a lot of the poor, or bad, scientific papers they might as well just be piped straight to /dev/null and wont be missed by anyone. Few if anybody reads them and they usually don't provide much in the realm of long term value. So the good science is probably going to be around.
For the private sector I guess a similar dark hole could be around for smaller companies as they might lose a lot of their data and work to. Only really important stuff is kept on a paper record, and those records might just be a print-out and those have horribly retention as the years go by.
The great digital loss will probably be a lot of the common, or normal, people data for those scientists that are into observing normal people like. After all more and more people are leaving less and less of a permanent footprint behind in the form of non-digital documentation. But still probably not a great loss for humanity as a whole that we don't have a perfect preservation of all the things people posted on social media.
(Score: 0) by Anonymous Coward on Sunday September 20 2020, @06:23PM
Speaking of encrypted blobs, doesn't Assange still have that file called Insurance.zip floating in the ether. Time to pop that baby open.
(Score: 2) by c0lo on Monday September 21 2020, @04:30AM
Internet... between the need of "the right to be forgotten" and the *poof* of Open Access Journals.
(Score: 2) by PiMuNu on Monday September 21 2020, @10:09AM
> Overall isn't the digital age just going to be one big black hole as far as a lot of historians will and are concerned?
Nope. Plenty of plastic waste to trawl through. Fossil record even will have quite something to say about the digital age.
(Score: 3, Interesting) by bzipitidoo on Sunday September 20 2020, @06:55PM (12 children)
Digital storage can be way cheaper. I find somewhat suspect these protestations that budget problems are crippling efforts to operate digitally. I should guess copyright law is a bigger impediment.
As far as not letting the perfect be the enemy of the good, we've also been bad at encoding data. Portable Document Format sucks, but for many things, it's the best we have. PDF is notorious for bloat. (For instance, for that and other reasons, whenever the green site linked to a PDF, it was accompanied by a statement "Warning: PDF".) And, while fairly open, it's still too proprietary. Most of all, it can be a total pain to modify. A misspelling or typo often can't be easily fixed. It wasn't designed to be modified, and I think that was a fundamental mistake. One thing I find very interesting is arxiv's preference for the source, in LaTeX. They will take a PDF, but they'd rather have the LaTeX.
(Score: 0) by Anonymous Coward on Sunday September 20 2020, @07:54PM
and lose any accompanying images
(Score: 0) by Anonymous Coward on Sunday September 20 2020, @09:17PM
ArXiv's desire for LaTeX is no different than your preference for your music in something like OGG: you can convert it into whatever format you need it to down the road.
(Score: 2) by hendrikboom on Sunday September 20 2020, @10:45PM
Compress it to djvu.
(Score: 5, Interesting) by deimtee on Sunday September 20 2020, @11:46PM (8 children)
That was the intention. PDF wasn't meant to be an archive format, it was designed for the print industry. It was meant to be a replacement for hardcopy. You wrote your document in whatever program you liked, and created a PDF by "printing" to the PDF driver. To the system, a PDF driver appeared to be a postscript printer. The PDF file included all fonts and graphics and should appear the same on any machine capable of displaying or printing it. This greatly reduced the need for back and forth proof copies and edits between creators and printers.
Adobe Acrobat 4.0 was about the peak of this usability and was really good at what it did.
Of course people started saying "wouldn't it be great if we could fix this typo instead of resupplying the whole file" and so pitstop was created to edit PDFs. It mostly even acted like hardcopy editing.
Then Adobe started trying to add other functionality like forms and playable multimedia. There was even a push by Adobe to use PDF's as web pages at one point - "Make your website look the same on every display device" appealed to designers but was hated by users. It is still universally used in the print industry, but in my opinion most of the stuff added since Acrobat 4.0 has been worse than useless.
No problem is insoluble, but at Ksp = 2.943×10−25 Mercury Sulphide comes close.
(Score: 2) by canopic jug on Monday September 21 2020, @04:12AM (2 children)
On top of that, PDF is a scripting language not a presentation or layout language. We'd be much better off standardizing on TeX.
Money is not free speech. Elections should not be auctions.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @06:34AM (1 child)
Isn't TeX still scripting since it's Turing complete?
(Score: 0) by Anonymous Coward on Monday September 21 2020, @09:16AM
Postscript (an interpreted programming language) is also Turing Complete,
...but not intended to be human-written. Originally files were all ascii, but later raster images were allowed too.
Fun tip -- if you have .ps source files, they can be displayed directly by lightweight ebook reader SumatraPDF.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @10:15AM (2 children)
You know the saying if it ain't broken don't fix it? So many things were great at first but everyone tries to 'fix' that which ain't broken to stay relevant instead of just leaving it alone and they end up just making everything worse.
That's why you now have ADs on paid cable TV. It ain't broken but, hey, let's find a way to milk even more money out of everyone.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @10:17AM (1 child)
Also it's why copy protection laws kept on getting continuously extended and expanded. It wasn't 'broken' per say but the industry had to keep on selfishly lobbying to make the laws even more and more ridiculous to the point where now pretty much everyone hates these laws and they hate the industry for lobbying for them.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @10:21AM
err .... per se *
(Score: 2) by bzipitidoo on Monday September 21 2020, @02:44PM (1 child)
> PDF ... was designed for the print industry.
Yes, a digital format designed for print. You appreciate the irony in that.
Thanks for mentioning Pitstop. I didn't know of it. The tools I know are the libre ones, such as Okular, Evince, pdftohtml, pdftotext, and several other command line tools that start with "pdf". Of course none of them can quite do all the things the commercial tools from Adobe can do. For instance, Okular can add text, but it can't do a proper job of adding images. While it has the ability to add images, it does so in a non-standard way that can only be read by Okular. Oh, and that method of adding text is not making use of the fillable forms ability that was added to the PDF standard, it's just a simple insertion of additional text. Not that the blank form the other business made available made use of fillable forms, still pretty rare to see that, which pretty much forces the users to use the hackish way of inserting text that Okular can do.
Speaking of other functionality, the business I'm working with is trying to use PDF as an all purpose digital document format for business. They need to fill in and sign documents which are often provided in PDF format only. It's another weirdness about the business world that they must have the source documents they used to create the PDF, but they behave as if giving those out is giving away trade secrets or something. The exact same form in PDF is okay, but the docx original is top secret! Why, if you had the source, your business might just copy their business's document, and, and, use it! Not that docx is a great format for business either.
So the easiest way to sign a document is to print it out, sign the paper by hand, then scan it, to PDF. (Some Very Important People sign documents so often they've had created a custom rubber stamp of their signature. Yeah, the literal rubber stamp.) That can also be the easiest, fastest way to fill out a form. Of course that loses all the text, thanks to the scanner treating the scan as a simple raster image. But that's the best way to scan a text document because OCR is not reliable enough to be trusted with an automatic conversion of a raster image back to text. Businesses sure don't want to spend the time and money to have a person check and correct errors in OCR jobs. So there's another factor that bloats the heck out of document storage. Let's take a sheet that was 20k of text, and turn it into 350k or more of raster image, woohoo! Good for sellers of digital storage.
Adobe has added digital signatures to the PDF standard. I am not clear on what they mean by that. Sounds like it's a joke. We're going to read a handwritten signature from one of those crappy pen tablets that are typically attached to a credit card processing machine, and digitally sign it with some sort of DRM like, self-certified public key that is worthless for proving that a signature is genuine thanks to the self-certifying manner in which it is created, and pass that off as a "secure" digital signature. And more, we're just going to trust that the signature we're digitally signing is not a forgery. Maybe Adobe even tells businesses how to do it right to make it genuinely secure, but this involves too much work, so, wink, wink, businesses skip it and take the easy and insecure route. Oh well, hand written signatures always were extremely weak proof that a document has been accepted and endorsed, so it's not like Adobe and their customers taking shortcuts has made matters any worse. I am particularly amused by the software that doesn't even use a touch tablet (even if it is being run on a such a device), instead giving the users a choice of several different handwriting fonts. Guess you're supposed to pick whichever font looks closest to your actual handwriting, but no one asks even for that.
(Score: 2) by deimtee on Tuesday September 22 2020, @02:39AM
I was a variable data specialist for a while. Been out of the print industry a few years now, but one of the ways I used to add stuff to PDFs was to put one in another [word|quark|indesign] document as a full page image, add text boxes or pics as required, then print that to the PDF driver (Acrobat Distiller) to make a new PDF incorporating the new elements.
Seems like you could almost automate that. Drop a pdf form in a hot folder, have Office or whatever open a new document and drop it in as the background, paste in a pic of your signature (black text on a transparent background), stop at that point to let you drag/resize the signature, click another custom script button to create the new PDF (called "DocumentName_signed.pdf") and drop it in your signed document output folder. You could even have it open up in a pdf-reader to check it worked.
No problem is insoluble, but at Ksp = 2.943×10−25 Mercury Sulphide comes close.
(Score: 1) by fustakrakich on Sunday September 20 2020, @09:58PM
They hold up pretty well. Then carve our Nvidia v. ATI benchmarks on the tunnel walls. Two thousand years from now the archeologists will think we were either writing prayer books or a shopping list.
La politica e i criminali sono la stessa cosa..
(Score: -1, Flamebait) by Anonymous Coward on Sunday September 20 2020, @10:42PM
Some journals are better off disappeared.
(Score: 2) by darkfeline on Sunday September 20 2020, @10:48PM (3 children)
> It takes continuous, unbroken effort and money to keep digital publications accessible even if only one person or institution wishes to retain acccess.
This sounds like it was paid for by the academic journal cabal. Make it legal to share and copy these, and let all the universities and researchers set up torrent seedboxes. I doubt all of the academic papers in the world together take up more space than a modern high quality full length movie.
Join the SDF Public Access UNIX System today!
(Score: 3, Interesting) by HiThere on Monday September 21 2020, @01:31AM
Anything that relies on dynamically maintained documents is dubious for archival data. CDs are better than DVDs, because they are more robust against damage/deterioration.
The problem is the difference between easy access and good archival quality, and the answer should be "use different media". Also ANYTHING that depends on encrypted keys being kept available is right out the window. It's totally useless for archival data. (Even if you can break the encryption, it makes it a lot more subject to errors causing the whole thing to be unreadable.)
The problem is CDs aren't stable over long periods of time. They're good for multiple decades if handled carefully, but they are inherently unstable, so they probably won't hold up for a century even in ideal conditions. (This isn't inherently true, but it's true for the versions that could be written by a home computer.) Microfiche were a lot better in this regard, but reading them by computer was a real problem.
The thing is, there hasn't been a lot of work done on producing archival quality media. There's little reward when you produce it, because most customers are more interested in ease of use, and lasting "long enough". Currently probably the best choice for large quantities of data is removable disk drives, but that's hardly archival quality. It lasts a decade or two if there aren't any unexpected problems. After that recovering the data is likely to be a major project, requiring opening the sealed drive, replacing the lubricants, and resealing it...at best.
(Score: 2) by canopic jug on Monday September 21 2020, @03:17AM (1 child)
I agree that it should be legal to share, copy, and re-distribute articles indefinitely. Torrents (when not centralized) would be one easy, currenly existing publication technology, but what mechanism do you propose to ensure the authenticity and general integrity of said documents? The situation we have now is that they are sourced from a single web site. While that ensures the authenticity it also introduces a single point of failure. If we encourage a distributed model, which we should and is long over due, then you have the problem of making sure that the article and its contents have not changed either by accident or on purpose.
Money is not free speech. Elections should not be auctions.
(Score: -1, Troll) by Anonymous Coward on Monday September 21 2020, @05:12AM
Blockchain! That's the answer to everything!
(Score: 2) by legont on Monday September 21 2020, @02:01AM (1 child)
Perhaps, they should not kill Aaron Swartz https://en.wikipedia.org/wiki/Aaron_Swartz [wikipedia.org]
"Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @01:31PM
“Backups are for wimps. Real men upload their data to an FTP site and have everyone else mirror it.” - Linus Torvolds
(Score: 0) by Anonymous Coward on Monday September 21 2020, @03:22AM
Can't have the dirty plebes sharing information and research, they might get smart and angry.