David Rosenthal discusses the last 25 years of digital preservation efforts in regards to academic journals. It's a long-standing problem and discontinued journals continue to disappear from the Internet. Paper, microfilm, and microfiche are slow to degrade and are decentralized and distributed. Digital media are quick to disappear and the digital publications are usually only in a single physical place leading to single point of failure. It takes continuous, unbroken effort and money to keep digital publications accessible even if only one person or institution wishes to retain acccess. He goes into the last few decades of academic publishing and how we got here and then brings up 4 points abuot preservation, especially in regards to Open Access publishing.
Lesson 1: libraries won't pay enough to preserve even subscription content, let alone open-access content.
[...] Lesson 2: No-one, not even librarians, knows where most of the at-risk open-access journals are.
[...] Lesson 3: The production preservation pipeline must be completely automated.
[...] Lesson 4: Don't make the best be the enemy of the good. I.e. get as much as possible with the available funds, don't expect to get everything.
He posits that focus should be on the preservation of the individual articles, not the journals as units.
Previously:
(2020) Internet Archive Files Answer and Affirmative Defenses to Publisher Copyright Infringement Lawsuit
(2018) Vint Cerf: Internet is Losing its Memory
(2014) The Importance of Information Preservation
(Score: 3, Interesting) by bzipitidoo on Sunday September 20 2020, @06:55PM (12 children)
Digital storage can be way cheaper. I find somewhat suspect these protestations that budget problems are crippling efforts to operate digitally. I should guess copyright law is a bigger impediment.
As far as not letting the perfect be the enemy of the good, we've also been bad at encoding data. Portable Document Format sucks, but for many things, it's the best we have. PDF is notorious for bloat. (For instance, for that and other reasons, whenever the green site linked to a PDF, it was accompanied by a statement "Warning: PDF".) And, while fairly open, it's still too proprietary. Most of all, it can be a total pain to modify. A misspelling or typo often can't be easily fixed. It wasn't designed to be modified, and I think that was a fundamental mistake. One thing I find very interesting is arxiv's preference for the source, in LaTeX. They will take a PDF, but they'd rather have the LaTeX.
(Score: 0) by Anonymous Coward on Sunday September 20 2020, @07:54PM
and lose any accompanying images
(Score: 0) by Anonymous Coward on Sunday September 20 2020, @09:17PM
ArXiv's desire for LaTeX is no different than your preference for your music in something like OGG: you can convert it into whatever format you need it to down the road.
(Score: 2) by hendrikboom on Sunday September 20 2020, @10:45PM
Compress it to djvu.
(Score: 5, Interesting) by deimtee on Sunday September 20 2020, @11:46PM (8 children)
That was the intention. PDF wasn't meant to be an archive format, it was designed for the print industry. It was meant to be a replacement for hardcopy. You wrote your document in whatever program you liked, and created a PDF by "printing" to the PDF driver. To the system, a PDF driver appeared to be a postscript printer. The PDF file included all fonts and graphics and should appear the same on any machine capable of displaying or printing it. This greatly reduced the need for back and forth proof copies and edits between creators and printers.
Adobe Acrobat 4.0 was about the peak of this usability and was really good at what it did.
Of course people started saying "wouldn't it be great if we could fix this typo instead of resupplying the whole file" and so pitstop was created to edit PDFs. It mostly even acted like hardcopy editing.
Then Adobe started trying to add other functionality like forms and playable multimedia. There was even a push by Adobe to use PDF's as web pages at one point - "Make your website look the same on every display device" appealed to designers but was hated by users. It is still universally used in the print industry, but in my opinion most of the stuff added since Acrobat 4.0 has been worse than useless.
If you cough while drinking cheap red wine it really cleans out your sinuses.
(Score: 2) by canopic jug on Monday September 21 2020, @04:12AM (2 children)
On top of that, PDF is a scripting language not a presentation or layout language. We'd be much better off standardizing on TeX.
Money is not free speech. Elections should not be auctions.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @06:34AM (1 child)
Isn't TeX still scripting since it's Turing complete?
(Score: 0) by Anonymous Coward on Monday September 21 2020, @09:16AM
Postscript (an interpreted programming language) is also Turing Complete,
https://en.wikipedia.org/wiki/PostScript#The_language [wikipedia.org]
...but not intended to be human-written. Originally files were all ascii, but later raster images were allowed too.
Fun tip -- if you have .ps source files, they can be displayed directly by lightweight ebook reader SumatraPDF.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @10:15AM (2 children)
You know the saying if it ain't broken don't fix it? So many things were great at first but everyone tries to 'fix' that which ain't broken to stay relevant instead of just leaving it alone and they end up just making everything worse.
That's why you now have ADs on paid cable TV. It ain't broken but, hey, let's find a way to milk even more money out of everyone.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @10:17AM (1 child)
Also it's why copy protection laws kept on getting continuously extended and expanded. It wasn't 'broken' per say but the industry had to keep on selfishly lobbying to make the laws even more and more ridiculous to the point where now pretty much everyone hates these laws and they hate the industry for lobbying for them.
(Score: 0) by Anonymous Coward on Monday September 21 2020, @10:21AM
err .... per se *
(Score: 2) by bzipitidoo on Monday September 21 2020, @02:44PM (1 child)
> PDF ... was designed for the print industry.
Yes, a digital format designed for print. You appreciate the irony in that.
Thanks for mentioning Pitstop. I didn't know of it. The tools I know are the libre ones, such as Okular, Evince, pdftohtml, pdftotext, and several other command line tools that start with "pdf". Of course none of them can quite do all the things the commercial tools from Adobe can do. For instance, Okular can add text, but it can't do a proper job of adding images. While it has the ability to add images, it does so in a non-standard way that can only be read by Okular. Oh, and that method of adding text is not making use of the fillable forms ability that was added to the PDF standard, it's just a simple insertion of additional text. Not that the blank form the other business made available made use of fillable forms, still pretty rare to see that, which pretty much forces the users to use the hackish way of inserting text that Okular can do.
Speaking of other functionality, the business I'm working with is trying to use PDF as an all purpose digital document format for business. They need to fill in and sign documents which are often provided in PDF format only. It's another weirdness about the business world that they must have the source documents they used to create the PDF, but they behave as if giving those out is giving away trade secrets or something. The exact same form in PDF is okay, but the docx original is top secret! Why, if you had the source, your business might just copy their business's document, and, and, use it! Not that docx is a great format for business either.
So the easiest way to sign a document is to print it out, sign the paper by hand, then scan it, to PDF. (Some Very Important People sign documents so often they've had created a custom rubber stamp of their signature. Yeah, the literal rubber stamp.) That can also be the easiest, fastest way to fill out a form. Of course that loses all the text, thanks to the scanner treating the scan as a simple raster image. But that's the best way to scan a text document because OCR is not reliable enough to be trusted with an automatic conversion of a raster image back to text. Businesses sure don't want to spend the time and money to have a person check and correct errors in OCR jobs. So there's another factor that bloats the heck out of document storage. Let's take a sheet that was 20k of text, and turn it into 350k or more of raster image, woohoo! Good for sellers of digital storage.
Adobe has added digital signatures to the PDF standard. I am not clear on what they mean by that. Sounds like it's a joke. We're going to read a handwritten signature from one of those crappy pen tablets that are typically attached to a credit card processing machine, and digitally sign it with some sort of DRM like, self-certified public key that is worthless for proving that a signature is genuine thanks to the self-certifying manner in which it is created, and pass that off as a "secure" digital signature. And more, we're just going to trust that the signature we're digitally signing is not a forgery. Maybe Adobe even tells businesses how to do it right to make it genuinely secure, but this involves too much work, so, wink, wink, businesses skip it and take the easy and insecure route. Oh well, hand written signatures always were extremely weak proof that a document has been accepted and endorsed, so it's not like Adobe and their customers taking shortcuts has made matters any worse. I am particularly amused by the software that doesn't even use a touch tablet (even if it is being run on a such a device), instead giving the users a choice of several different handwriting fonts. Guess you're supposed to pick whichever font looks closest to your actual handwriting, but no one asks even for that.
(Score: 2) by deimtee on Tuesday September 22 2020, @02:39AM
I was a variable data specialist for a while. Been out of the print industry a few years now, but one of the ways I used to add stuff to PDFs was to put one in another [word|quark|indesign] document as a full page image, add text boxes or pics as required, then print that to the PDF driver (Acrobat Distiller) to make a new PDF incorporating the new elements.
Seems like you could almost automate that. Drop a pdf form in a hot folder, have Office or whatever open a new document and drop it in as the background, paste in a pic of your signature (black text on a transparent background), stop at that point to let you drag/resize the signature, click another custom script button to create the new PDF (called "DocumentName_signed.pdf") and drop it in your signed document output folder. You could even have it open up in a pdf-reader to check it worked.
If you cough while drinking cheap red wine it really cleans out your sinuses.