Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Wednesday May 26 2021, @01:13PM   Printer-friendly
from the vendor-capture dept.

There are still a few months to fix this, but for now the US Patent and Trademark Office's (USPTO) Acting Commissioner for Patents, Andrew Faile, and Chief Information Officer, Jamie Holcombe, have announced that starting January 1st, 2022, the USPTO will institute a surcharge for applicants that are not locked into Microsoft products via the proprietary DOCX format. From that date onwards, the USPTO will move away from PDF and require all filers to use that proprietary format or face an arbitrary surcharge when filing.

First, we delayed the effective date for the non-DOCX surcharge fee to January 1, 2022, to provide more time for applicants to transition to this new process, and for the USPTO to continue our outreach efforts and address customer concerns. We've also made office actions available in DOCX and XML formats and further enhanced DOCX features, including accepting DOCX for drawings in addition to the specification, claims, and abstract for certain applications.

One out of several major problems with the plans is that DOCX is a proprietary format. There are several variants of DOCX and each of them are really only supported by a single company's products. Some other products have had progress in beginning to reverse engineering it, but are hindered by the lack of documentation. DOCX is a competitor to the fully-documented, open standard OpenDocument Format, also known as ISO/IEC 26300.

DOCX is not to be confused with OOXML, though it often is. While OOXML, also known as ISO/IEC 29500, is technically standardized, it is incompletely documented and only vaguely related to DOCX. The DOCX format itself is neither fully documented nor standard. So the USPTO is also engaged in spreading disinformation by asserting that it is.

Previously:
(2015) Microsoft Threatened the UK Over Open Standards


Original Submission

Related Stories

Microsoft Threatened the UK Over Open Standards 40 comments

When the UK government announced plans to shift to the .odf Open Document Format, and away from Microsoft's proprietary .doc and .docx formats, Microsoft threatened to move its research facilities out of the UK.

The prime minister's director of strategy at the time, Steve Hilton, said that "Microsoft phoned Conservative MPs with Microsoft R&D facilities in their constituencies and said we will close them down in your constituencies if this goes through" "We just resisted. You have to be brave," Hilton said.


Although I am not a great lover of Microsoft, I'm not sure that this is any different than many other companies who will try to protect their profits - and, arguably, the jobs of their employees - when they can see the potential for the loss of business. But perhaps other companies are a little more subtle - especially when it is obvious that official papers will one day become public knowledge.

[Editor's Comment: This submission has been significantly edited - comment is not attributable to sigma]

[Editor's Comment: Please see public apology regarding this story.]

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Insightful) by Anonymous Coward on Wednesday May 26 2021, @01:31PM (13 children)

    by Anonymous Coward on Wednesday May 26 2021, @01:31PM (#1138904)

    Why would they transition away from PDF to docx, I can think of only one reason.

    Which brings me to the next question, who is getting paid to screw us all over?

    • (Score: 2, Informative) by Anonymous Coward on Wednesday May 26 2021, @03:15PM (6 children)

      by Anonymous Coward on Wednesday May 26 2021, @03:15PM (#1138958)

      Sadly there is a very important reason to move from PDF to another format for legal documents: Modern PDF formats are no longer deterministic, meaning that the contents can change depending on when or where the file is viewed.

      • (Score: 1, Insightful) by Anonymous Coward on Wednesday May 26 2021, @03:52PM (5 children)

        by Anonymous Coward on Wednesday May 26 2021, @03:52PM (#1138974)

        And docx doesn't have that problem? Or more importantly, doesn't YET have that problem? Just because a mis-feature like that is one format does not mean it can't/won't be implemented elsewhere.

        If this were really the issue, then the CORRECT solution would be to mandate use of a PDF SUBSET instead. Anything that does not work in Acrobat Reader 5.0.5 will automatically get stripped out.

        • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @04:39PM (3 children)

          by Anonymous Coward on Wednesday May 26 2021, @04:39PM (#1138992)

          Considering all the issues with word macro viruses of the past, it seems rather foolish to go with any standard that allows for things other than text to be embedded directly into the document.

          • (Score: 5, Interesting) by ElizabethGreene on Wednesday May 26 2021, @07:52PM (2 children)

            by ElizabethGreene (6748) on Wednesday May 26 2021, @07:52PM (#1139069)

            Patents rely very heavily on images, so you'd need a combination of text and images.

            If only we had some kind of hypertext markup language that allowed the combination of text and images into some form of document. It'd be really cool if it allowed you to specify headings, sections, subsections, image captions, etc. too.

            • (Score: 2) by nostyle on Thursday May 27 2021, @02:02AM (1 child)

              by nostyle (11497) on Thursday May 27 2021, @02:02AM (#1139154) Journal

              So why don't word processors input/output something like SGML anyway?

              • (Score: 2) by ElizabethGreene on Thursday May 27 2021, @03:47PM

                by ElizabethGreene (6748) on Thursday May 27 2021, @03:47PM (#1139335)

                The realization I'm working through is that they all do output something like SGML.

                For .rtf the control words all start with \ and there is some funny grouping with brackets.

                {\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
                  This is some {\b bold} text.\par
                  }

                (Source: Wikipedia)

                LaTex likes \control words too.

                PDF was shown above.

                For .docx, sgml, and html, the control words are xml-like tags.

                Where I'm sitting it looks like it's markup all the way down.

                The ideal file format would be tiny in total file size (compressed text), contain trivially parse-able markup around the clear text (html), high resolution digital images of the text as the creator intended it to be displayed (tiff), some kind of pinning between the images and clear text to allow copy/paste intelligently (word), metadata about the file source/history(?), and be forward/backward compatible forever (plain text). I don't think it exists.

                Obligatory XKCD [xkcd.com]

        • (Score: 1, Informative) by Anonymous Coward on Thursday May 27 2021, @01:46AM

          by Anonymous Coward on Thursday May 27 2021, @01:46AM (#1139151)

          It's just one of many problems with docx, but it gives them something to point at while embracing the 'open' OOXML 'standard'. That docx isn't OOXML doesn't matter either. As long as there is enough noise to keep people distracted and disoriented then any attempt at a coherent objection can be subverted.

    • (Score: 5, Informative) by Anonymous Coward on Wednesday May 26 2021, @05:02PM (5 children)

      by Anonymous Coward on Wednesday May 26 2021, @05:02PM (#1139007)

      Why would they transition away from PDF to docx, I can think of only one reason.

      Then your ability to reason out possible alternatives needs some exercise.

      Why transition away from PDF? Because the legacy PDF system is based upon using PDF as a carrier for bitmap scanned images. Most of the PDF's simply contain a scanned image of a sheet of paper. The reason for DOCX is to receive the data prior to converting it to "layout format" so that the actual text content can be extracted and the higher level document structure detected (i.e., what is a header, where paragraphs start/end, etc.). PDF, even when the PDF is textual, is a pure layout format. Internally, a textual pdf is simply a series of instructions to position letters at specific x,y positions on a virtual page. There are no concepts of "this block of text is a paragraph" or "this line is a level 2 heading line". Extracting the text from a PDF is possible (provided the PDF authoring library includes the, sadly, optional table mapping byte values within the PDF to unicode code points) but all the higher level document structure is gone. You simply get letters or words positioned at x,y coordinates on a virtual printed sheet of paper.

      Now, you'll naturally want to move the goalposts to, "ok, but why DOCX?". And the answer there is trivially simple. Because the vast majority of the attorneys are already using msword to craft the documents in the first place, so they are just accepting what the lawfirms already use. Convincing lawfirms that the now need to install LibreOffice instead of msword to then write an ODT file would be like trying to pull teeth from chickens.

      Is DOCX the best choice, no. But it is the pragmatic choice given that almost all of the law firms are already using msword.

      • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @06:15PM

        by Anonymous Coward on Wednesday May 26 2021, @06:15PM (#1139039)

        In correct there is 1 group that needs to correct the error: Microsoft They need to generate compatible files - not the other way around.

        DOCX still has many fearyes defined by OLD standards from Word... going back to Word3 (1990). Get the job done right ONCE!!!

        F..king Monopoly.

      • (Score: 1, Insightful) by Anonymous Coward on Wednesday May 26 2021, @08:01PM

        by Anonymous Coward on Wednesday May 26 2021, @08:01PM (#1139073)

        fuck the lawyers and the dumb ass whores at the USPTO! Extorted public money? Open fucking formats. Stupid pieces of shit!

      • (Score: 1, Informative) by Anonymous Coward on Thursday May 27 2021, @02:48AM

        by Anonymous Coward on Thursday May 27 2021, @02:48AM (#1139166)

        "Convincing lawfirms that the now need to install LibreOffice instead of msword to then write an ODT file would be like trying to pull teeth from chickens. "

        No need to convince anyone. Msword can save files in ODT format.

      • (Score: 2) by FatPhil on Thursday May 27 2021, @12:29PM (1 child)

        So you're saying we need a Rich Text Format instead? Or if we wish parts of the document (such as a ToC, index, or body text) to be able to refer to other parts of the document (such as sections, images, or tables within), or even extern documents, then some kind of Hypertext Markup Language?
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  • (Score: 5, Insightful) by DannyB on Wednesday May 26 2021, @01:34PM (12 children)

    by DannyB (5839) Subscriber Badge on Wednesday May 26 2021, @01:34PM (#1138905) Journal

    Consider the long term benefits of locking patents into Microsoft DOCX format.

    Patents last a long time. Way too long.

    Patents last longer than Microsoft's interest in backward compatibility, despite Microsoft investments to make its software the most backward. (see: MS Office, Plays For Sure, Zune, Windows phone 6, Windows Phone 7, Windows Phone 8, Windows Vista, Windows 8, Visual Basic, Windows for tablets, Windows on ARM, and a list of other internal technologies that they lost interest in, but not as quickly as Google loses interest in its own products. Yet Microsoft makes sure to keep important things like 16 bit and 32 bit components.)

    Raise your hand if you've ever had a Microsoft Office application that was unable to open a Microsoft document created from a significantly older version of that application?

    And what is the reason for this (in)sane behavior? Earlier binary Microsoft docs were once little more than memory binary images of internal data structures, or so I once read. This is what made it extremely difficult even for Microsoft to completely reverse engineer. One would have to look at the source code to analyze what fields and values were where and what they actually meant vs what they might be labelled or what the comments might say.

    Imagine a bright and glorious future where patents are enforced, but nobody anywhere can produce a readable copy of the patent, but can produce proof the patent was granted.

    We should also consider the mechanism by which the USPTO grants patents.

    The patent examination process is not well understood by most people. Once a patent is received, the patent examiner carefully places the application into a room full of other patent applications. Then kittens are released into the room with PATENT GRANTED stamps affixed to their feet. The kittens are then returned to their holding area to await the next round of patent examination. The patent examiners collect the applications from the floor and look to see which patent applications were granted.

    The USPTO is going to need additional funding to have a new process in which they carefully track the existence of temporary internal printed copies of patents to ensure their destruction once the patent is fully examined and granted.

    --
    Scissors come in consumer packaging that cannot be opened without scissors.
    • (Score: 3, Insightful) by JoeMerchant on Wednesday May 26 2021, @05:00PM (10 children)

      by JoeMerchant (3937) on Wednesday May 26 2021, @05:00PM (#1139004)

      I'm sorry, hasn't Libre Office supported .docx format for like the last 15 years or something? It's not exactly a lock-in format if you can access it with well supported FOSS.

      --
      Україна досі не є частиною Росії. https://en.interfax.com.ua/news/general/878601.html Слава Україні 🌻
      • (Score: 1, Insightful) by Anonymous Coward on Wednesday May 26 2021, @08:45PM

        by Anonymous Coward on Wednesday May 26 2021, @08:45PM (#1139089)

        Sort-of. I had a resume done professionally, and the final product included a pdf and a docx. The latter was not at all usable in Libre Office thanks to all the fancy formatting. I could open the docx to look at it, but that was about it, and the layout didn't match the pdf.

      • (Score: 2) by Gaaark on Wednesday May 26 2021, @09:04PM (1 child)

        by Gaaark (41) Subscriber Badge on Wednesday May 26 2021, @09:04PM (#1139093) Journal

        But why say "Hey, if it won't open in MS-Office, just open it in Libreoffice." when you could just say "Hey, everyone just use Libreoffice and fuck MS0Office."

        Why make everyone use MS-Office, but sometimes you'll have to use Libreoffice, especially when you'll have to use it for older docs? Why not just Libreoffice from the start?

        --
        --- Please remind me if I haven't been civil to you: I'm channeling MDC. ---Gaaark 2.0 ---
        • (Score: 2) by JoeMerchant on Thursday May 27 2021, @01:22AM

          by JoeMerchant (3937) on Thursday May 27 2021, @01:22AM (#1139142)

          There was a time around 2005 when I was using Open Office because M$ office couldn't f-ing handle what I was trying to do (embed more than 6 high resolution photographs in a single document.) Same format, but if I edited it in "real" MS Office, it would lock up or crash, but the same document with 12+ embedded photos from my digital camera would edit save and print just fine from Open Office, and ironically after creating it in Open Office you could even view it in MS Office O.K. but just don't mess with the embedded photos in Office 2004 or whatever it was unless you wanted to see a BSOD.

          --
          Україна досі не є частиною Росії. https://en.interfax.com.ua/news/general/878601.html Слава Україні 🌻
      • (Score: 2) by DannyB on Wednesday May 26 2021, @09:17PM (6 children)

        by DannyB (5839) Subscriber Badge on Wednesday May 26 2021, @09:17PM (#1139098) Journal

        I'm sorry, hasn't Libre Office supported .docx format for like the last 15 years or something?

        That is not an excuse to require important public documents to be locked in to a format belonging to one company.

        PDF viewers are more widely supported than anything that can read DOCX.

        --
        Scissors come in consumer packaging that cannot be opened without scissors.
        • (Score: 2) by JoeMerchant on Wednesday May 26 2021, @11:41PM (5 children)

          by JoeMerchant (3937) on Wednesday May 26 2021, @11:41PM (#1139122)

          I guess the point is that M$ doesn't exactly own the format.

          --
          Україна досі не є частиною Росії. https://en.interfax.com.ua/news/general/878601.html Слава Україні 🌻
          • (Score: 2) by PartTimeZombie on Thursday May 27 2021, @12:30AM (4 children)

            by PartTimeZombie (4827) on Thursday May 27 2021, @12:30AM (#1139134)

            But they do, and it's pointed out in the summary:

            There are several variants of DOCX and each of them are really only supported by a single company's products. Some other products have had progress in beginning to reverse engineering it, but are hindered by the lack of documentation.

            Which is Libreoffice's problem. If Microsoft documented the format then Libreoffice could support it properly, which is why Microsoft don't document it properly.

            • (Score: 2) by JoeMerchant on Thursday May 27 2021, @01:19AM (3 children)

              by JoeMerchant (3937) on Thursday May 27 2021, @01:19AM (#1139140)

              Well, from a practical aspect, I have submitted .docx documents to all kinds of people over the past decade+ without ever paying for a M$ license. Stay away from the fancy formatting and it renders fine on both sides.

              Is it right? No. There should be an open standard that both sides support 100%. Will that happen? Doesn't seem likely- but from a practical view: you can submit all kinds of .docx documents without ever paying the M$ tax - and as others have pointed out: 95%+ of patent applications are submitted by lawyers who 99.9%+ have M$ licenses anyway. Is it right? No. Does it matter? In all ways but the idealistic, no.

              --
              Україна досі не є частиною Росії. https://en.interfax.com.ua/news/general/878601.html Слава Україні 🌻
              • (Score: 2) by PartTimeZombie on Thursday May 27 2021, @01:45AM (2 children)

                by PartTimeZombie (4827) on Thursday May 27 2021, @01:45AM (#1139150)

                Stay away from the fancy formatting and it renders fine on both sides.

                While that is largely true, but if Microsoft documented the format properly that wouldn't matter either.

                • (Score: 2) by JoeMerchant on Thursday May 27 2021, @02:15AM (1 child)

                  by JoeMerchant (3937) on Thursday May 27 2021, @02:15AM (#1139158)

                  if Microsoft documented the format properly that wouldn't matter either.

                  Yeah, M$ hasn't been properly documenting their shit since forever. In 1991 I got a persistent "Error 67" from MS DOS 3.whatever it was. Thing was, the printed manual only described up through error 64.

                  --
                  Україна досі не є частиною Росії. https://en.interfax.com.ua/news/general/878601.html Слава Україні 🌻
                  • (Score: 2) by PartTimeZombie on Thursday May 27 2021, @03:06AM

                    by PartTimeZombie (4827) on Thursday May 27 2021, @03:06AM (#1139171)

                    You're right, and that is the point. If they documented stuff they'd open themselves up to competition.

    • (Score: 0) by Anonymous Coward on Thursday May 27 2021, @02:14AM

      by Anonymous Coward on Thursday May 27 2021, @02:14AM (#1139156)

      Raise your hand if you've ever had a Microsoft Office application that was unable to open a Microsoft document created from a significantly older version of that application?

      It's not like Microsoft has a monopoly on this though. TaxCut 2020 won't open my return for 2018.

  • (Score: 4, Insightful) by looorg on Wednesday May 26 2021, @01:54PM (9 children)

    by looorg (578) on Wednesday May 26 2021, @01:54PM (#1138912)

    While I find it quite interesting, and stupid, to lock into one format one should perhaps not make to big a deal out of it. Nothing stops you from just taking your normal .doc file and rename it .docx -- and that will work just fine or that is to say it will load into MS Word without issues. If you have a PDF file you can just convert it to a doc file. etc. But I doubt this is the main issue here.

    I assume they, USPTO, have some kind of forms, available in docx, where they are trying to streamline the application process where they have some program or script that grabs things from the document and inserts it into their systems and that becomes somewhat harder when everyone uses their own formats or systems or ways of application. They probably already have an API for it so that as soon as you submit it then can take all that info and insert it into their system in the proper places.

    Lots of companies these days try this sort of thing, with fairly abysmal results. They take some document of unknown format and they try to use "AI" to pull the relevant info from the documents. It's usually fairly easy with things like phone numbers, email addresses and postal addresses and such with a fairly high success rate but after that shit just tends to fall apart rapidly. So if they can control the format and know exactly where what is placed in a document it becomes so much easier to do this sort of thing. No more "AI" guess work, you can just pull it cause you know where it is going to appear.

    • (Score: 2) by SomeGuy on Wednesday May 26 2021, @03:56PM

      by SomeGuy (5632) on Wednesday May 26 2021, @03:56PM (#1138977)

      "Nothing stops you from just taking your normal .doc file and rename it .docx"

      Perhaps for now. But what about further down the road when Microsoft pulls the plug? (because "security" )

      That also does not help the real problem either, as .DOC is not fully documented or open and does not 100% reliably represent exactly how a document would appear when printed.

    • (Score: 2) by bzipitidoo on Wednesday May 26 2021, @05:09PM (6 children)

      by bzipitidoo (4388) Subscriber Badge on Wednesday May 26 2021, @05:09PM (#1139012) Journal

      PDF really sucks in a number of ways. Firstly, for edits, it is the absolute worst, least editable format there is. Granted, it wasn't meant to be edited, but we have learned that inability to make edits is not an advantage, it's a disadvantage, a big one. Making changes directly to a PDF is possible, and there are many tools for doing that, but it is a major pain. Even just lifting the text out and pasting it back into a word processor may not be straightforward. For one thing, there is no requirement that the text in a PDF be ordered in the same order as read.

      The next huge disadvantage of PDF is how very wasteful of space it is. Was it really so impossible to standardize on the math necessary to calculate letter positions? It's like they never heard of FORTRAN, and that people grappled with that sort of problem, of exactly reproducing mathematical results on different hardware, all the way back to the 1950s. PDF avoids that issue by explicitly coding the position of everything, down to the individual letters. Very costly way to dodge that issue, and also, the chief reason why PDF is such a pain to edit. To add to the waste, the way PDF encodes positions is very inefficient. Every time the letter spacing is changed, the standard requires that the opacity be specified. And so, in 99% of PDFs with text, over and over and over, the opacity is set to 100%. Opacities other than 100% just aren't used that much.

      Of course, switching to docx is just trading one set of troubles and limitations for another arguably worse set.

      • (Score: 2, Insightful) by Anonymous Coward on Wednesday May 26 2021, @05:35PM

        by Anonymous Coward on Wednesday May 26 2021, @05:35PM (#1139026)

        in terms of patent applications, inability to make edits should be considered a feature, not a bug.

      • (Score: 1, Informative) by Anonymous Coward on Wednesday May 26 2021, @05:38PM (4 children)

        by Anonymous Coward on Wednesday May 26 2021, @05:38PM (#1139027)

        All of this is because PDF was designed to be:

        Electronic Paper

        That is why a PDF simply specifies the x,y position of each letter/word on the page. Because PDF was meant to reproduce a sheet of paper in an electronic form.

        Everything else shoehorned in later (editing, notes, highlighting, etc.) was added to try to keep it relevant in view of better formats for those other tasks.

        • (Score: 2) by bzipitidoo on Wednesday May 26 2021, @11:00PM (3 children)

          by bzipitidoo (4388) Subscriber Badge on Wednesday May 26 2021, @11:00PM (#1139117) Journal

          > Electronic Paper

          And that idea is fundamentally flawed. There's nothing holy about paper. Paper is great stuff, it's worked for thousands of years for information storage, but now, now we at last have technology that has many advantages over paper, as well as a few disadvantages that are massively outweighed by all those advantages. It's just crazy to hobble our tech by forcing it to act similarly to paper's limitations.

          What reason, really, is there to make it hard to edit an electronic document? If it's so that readers can enjoy some assurance that a document has not been altered, not corrupted, that's a phantom. It's not even as good as security through obscurity. It's security through inconvenience. Not only does PDF fail miserably at being completely unalterable, there are plenty of reasons why we shouldn't want that feature, and indeed should regard it as a misfeature. Reason we think inalterability might be good, and think that property of PDF is desirable and fondly imagine that it is much more intended and effective than it actually is, is a sort of romanticism about the past, and a worldview of text and information as valuable, precious, and, something that can be coveted, hoarded, and denied to others, rather like gold. As if unchangeability and permanence makes a format a worthy medium to hold holy texts such as the Ten Commandments. We have digital signatures that are far better than the too vaunted imagined permanence and unchangeability of ink on paper, or even "carved in stone".

          That kind of thinking and reverence is wrong, and bad. It's liking the book, and the paper in it, more than the contents of the book. Liking that PDF is hard to change is one manifestation of that attitude. Strong believe in IP is another manifestation.

          • (Score: 3, Insightful) by Anonymous Coward on Thursday May 27 2021, @01:41AM (2 children)

            by Anonymous Coward on Thursday May 27 2021, @01:41AM (#1139148)

            Electronic Paper

            And that idea is fundamentally flawed.

            Ah, grasshopper, for this I will have to take you on a travel through time.

            What reason, really, is there to make it hard to edit an electronic document?

            You appear to have cause and effect reversed based on that statement. PDF was not designed to be intentionally hard to edit. Editing a PDF after it was produced was not even in the design parameters. PDF is hard to edit because it is electronic paper, not the other way round.

            For this time warp, I have to drag you back in time to circa. 1991-1992, when PDF was first being designed at Adobe. This is the day before the general public having any knowledge of "the internet" (it existed, but most had no access, so to them it did not exist). Documents were created on a multitude of different word processors, themselves running on a multitude of different operating systems for a multitude of different machines. Yes, the seeds had been sown for the Intel chips to take over the world, and for ms to take over the world, but neither crop had borne much fruit yet, so both were just one of many competing to be the eventual winners.

            Electronic communication, such as it was, consisted of someone trying to email a file via Compuserve or The Source or Prodigy or maybe AOL (I think AOL did exist, but AOL had not introduced the masses to the internet yet). The problem was, given 12+ different word processors, unless the recipient had the same word processor, sometimes on the same hardware, and with the same set of font files installed (font files cost real money then, so this part was not always certian) there was no real guarantee that what the recipient saw when viewing your attachment looked the same as how it looked when you authored it before sending. And if you did use any of those fancy font files you might have bought, chances were, the destination view either differed dramatically, or did not show anything at all. So the only way to be sure the recipient saw your document in all its pixel perfect glory as you saw was to print it to paper, stuff the paper in an envelope, attach postage, and mail it to the recipient. And for some folks (who still exist today, they are the ones designing websites with fixed widths and heights on everything that won't adjust to the view screen, because their art is more important than the message to them) it was very important that the receiver see all the wonderful fancy formatted glory that was put into crafting the document.

            And perish the thought of simply opening a foreign document format from word-processor WP in word-processor XY. Word processors opened/saved their own formats and did not even acknowledge that other formats existed. And if you were lucky enough to have a conversion program for the two formats you needed to convert between, then often what came out as an XY file from a WP file input looked like it had been passed through a shredder along the way.

            This was the world that Adobe was operating within. They were, at this point, primarily targeting desktop publishing activities, as in fancy magazine layouts/etc. where the positioning of every single item on a page was (in their mind) critically important. And they wanted to create some kind of electronic file format that would allow a desktop publisher designer to design a page and electronically convey that designed page to someone else, and someone else have a reasonable hope of seeing it look the same as it did when the designer finished deigning it. And given that the standard medium of exchange, at the time, was printed sheets of paper, and that their (Adobe's) tools were tools for creating fancy formatted sheets of paper, they decided to create a file format that contained the typesetting instructions to place font glyphs at specific locations on a virtual sheet of paper. Actually, they created PDF as a cut-down, non-turing complete, version of their own Postscript programming language which had, again, itself been designed for placing marks on pieces of paper. I.e., most everything they had created at this time was designed around "creating sheets of printed paper". So it is only natural that their new document format was created as "electronic paper".

            And "electronic paper" it really is. The low level PDF instructions to place font glyphs on the output "electronic paper" consist of basically three types:

            1. pick a font to use, and scale it to some size
            2. position a virtual cursor at an x,y coordinate on the virtual sheet of paper
            3. draw text glyphs at the current virtual cursor position

            And, for circa 1991-1992, and for Adobe's intended use (as a final destination output format for a page design) this probably seemed like a very reasonable thing to do. The intent was that if someone needed to edit, they would edit the source and then print-to-pdf all over again, not attempt to edit the pdf.

            And, if you consider the two "drawing primitives" provided: "position cursor at x,y" and "draw text at cursor" you see why editing is very hard. In order to insert a word in the middle of a line, an editor has to first insert the text, then take the remainder of the line and adjust the x,y coordinates to shift it over. Then it has to figure out what text would extend past the page margin (with nothing in the document telling it where the margin was in the original document) and move that text to the next line, where it has to repeat the process of adjusting all these x,y positioning to make things line back up again. And if even one word falls off the bottom of a page, then the editor program has to repeat this adjusting of x,y positions for every other page after the current page in the PDF, adjusting all those "position cursor at x,y" and "draw text at cursor" commands for each page.

            So the "hard to edit" part came about as a result of the design of how it would draw text on a page, which itself was derived from how Postscript drew text on a page. PDF's are not hard to edit because Adobe set out to make them hard to edit, they are hard to edit because Adobe decided to capture the literal positioning data that a typesetting program generates to position items on a page as the document storage format. Doing that allowed them to guarantee one of the very early marketing slogans for PDF (it looks identical everywhere, actually I think it was "it prints identically everywhere"), but also resulted in a file storage format that was extremely hard to edit.

            • (Score: 3, Interesting) by bzipitidoo on Thursday May 27 2021, @04:57AM (1 child)

              by bzipitidoo (4388) Subscriber Badge on Thursday May 27 2021, @04:57AM (#1139183) Journal

              Thanks, but I know all that. I was there.

              Correct, Adobe wasn't worrying about editability one way or the other, and yes, the intended way to change a PDF is to edit the source and generate a new PDF. I am glad you mention Postscript. Postscript was intended as instructions for printers. It was of course easy to divert those printer instructions to a file, and then what was needed was a reader that could render them to a graphical display screen. Also needed graphics capable of displaying the results-- 640x480 is real tight, and higher resolutions than that didn't become widely available until the 1990s. Soon it became more common to display a postscript file to the screen than print it to paper.

              Once you get away from paper, and consider how best to store knowledge digitally, it's rather obvious that PDF is terrible. The interesting part of any document is the contents, not typesetting data. The two should not be jumbled all together. HTML has a lot of shortcomings, but it is just a plain better approach to the problem. The world isn't standardized on 8.5x11 inch paper. PDF cannot adjust to a change of size, HTML can. PDF's very rigidity is why it has to contain the fonts, HTML easily accommodates changing to different fonts. And, shouldn't the readers have the option to pick the font, if they don't like the writer's choice? PDF lets the writer dictate that and other such details to the reader. HTML gives the reader much more control.

              You mention that all the word processors saved documents by basically dumping their working memories to files, and this resulted in nothing being compatible with anything else. Yes, but there was a standard then, and it was even free and open: LaTeX, and before that, TeX.

              • (Score: 0) by Anonymous Coward on Friday May 28 2021, @03:06PM

                by Anonymous Coward on Friday May 28 2021, @03:06PM (#1139653)

                Thanks, but I know all that. I was there.

                A fact that is impossible to tell from just a username.

                Soon it became more common to display a postscript file to the screen than print it to paper.

                Which I suspect had a large impact in Adobe's invention of PDF. PDF is the Postscript font and rendering engine hooked up to a different set of formatting instructions. Since Postscript was already calculating exact pixel positioning for every item drawn on a page, having PDF simply be a format that archived that positioning info meant that PDF was not much of a change from Postscript (the single biggest difference is dropping the general purpose programming language part of Postscript). PDF is largely what you get if you start with Postscript, remove all the general purpose programming language commands, and rename the "drawing commands" into different names.

                Once you get away from paper, and consider how best to store knowledge digitally, it's rather obvious that PDF is terrible.

                100% agreement. PDF is not at all a good format into which to store data of any form. The only and only thing PDF does well is preserve the physical page layout of the printed document, hense my referring to PDF as electronic paper. It really is little more than electronic paper

                The interesting part of any document is the contents, not typesetting data. The two should not be jumbled all together. HTML has a lot of shortcomings, but it is just a plain better approach to the problem. The world isn't standardized on 8.5x11 inch paper.

                Also full agreement. PDF is rigid, just like a physical sheet of paper can't change size to accommodate some difference in viewing, neither can PDFs. PDF's simply preserve the exact pixel positioning of everything on the page.

                PDF lets the writer dictate that and other such details to the reader. HTML gives the reader much more control.

                Yup, and there is probably an underlying reason (beyond that Adobe simply distilled Postscript down to just the "drawing commands") for why PDF is so rigid. Have you ever had the miss-fortune to work with any of the "page designer" or "page layout" crowd? I.e., the folks one hires to do the magazine layout and decide how things should look? These folks, almost 100%, all consider the "design" (the layout, where things are positioned, how much space is here, how big this font is set) over the actual "content" of anything the produce. A huge part of this is because for them, often, when they are producing a layout design, the content is something like lorem ipsum [wikipedia.org] text (i.e., meaningless filler) and so the only thing they deal with, and the only thing they can use to pat each other on the back for "job well done" is the layout (i.e., the physical arrangement of stuff on the page).

                These same folks are also almost rabid in their belief that an end recipient of their wondrous "design" should only ever be able to see their wondrous "design" in its exact, pixel perfect, positional glory. This comes in large part from the "design" being all they have to congratulate themselves about, since the content, for them, when they did the job was just lorem ipsum. And back in the late 80's an early 90's at Adobe, this was the world into which Adobe was pandering their software offerings. The page layout designer who was rabid in his/her belief that their design should never be modified from the beautiful work of art they created by anyone viewing it later on any medium. With this being their world, it is no wonder that the folks at Adobe who dreamed up converting Postscript into what became PDF saw no problems what-so-ever with PDF's rigidity. The expectation in their world was that the document storage format should rigidly preserve their wondrous design for everyone to marvel at who later viewed it.

                These same folks are also why HTML has been soiled by CSS that provides the ability to do pixel exact, unchanging, positioning and sizing. They simply could not handle the concept that something they "designed" might be modified by an end users browser such that things were no longer exactly positioned where they, the designer, decided they should be positioned. Every single CSS declaration where there is the ability to exactly position some HTML element is there as pandering to this world view on the part of the layout artists.

                You mention that all the word processors saved documents by basically dumping their working memories to files, and this resulted in nothing being compatible with anything else.

                Nope, I said nothing of the sort. Someone else has mentioned that MSWord's old DOC format was basically a memory dump from word's heap, and that fact has been known for some time. But whomever mentioned that wasn't me. What I said was there were something like 12 different word-processors, each reading/writing 12 different file formats (each format specific to the WP that wrote it), and with none of the 12 providing much of any ability to interoperate with the others (i.e., read/write the other 11 formats that were not their own). But I did not say that all 12 were memory dumps. They might have all been memory dumps, or maybe only one of them was a memory dump (msword's doc format). But I never said they were all memory dumps, just that they were all incompatible with sharing with each other.

                but there was a standard then, and it was even free and open: LaTeX, and before that, TeX.

                Indeed, yes, there was. And unless one was an academic going for their doctorate in one of the sciences that published via TeX/LaTeX one generally knew nothing of the existence of those tools. A format based on Tex/LaTex source, plus enough extra baggage to carry any custom fonts used by the Tex/LaTex source, would have been a far superior way to exchange documents that were also useful as data sources than PDF will ever be.

    • (Score: 1, Informative) by Anonymous Coward on Wednesday May 26 2021, @05:31PM

      by Anonymous Coward on Wednesday May 26 2021, @05:31PM (#1139023)

      I assume they, USPTO, have some kind of forms, available in docx, where they are trying to streamline the application process where they have some program or script that grabs things from the document and inserts it into their systems and that becomes somewhat harder when everyone uses their own formats or systems or ways of application. They probably already have an API for it so that as soon as you submit it then can take all that info and insert it into their system in the proper places.

      This is the reason (except the "forms" part is not). The DOCX file is XML inside a zip wrapper. The backend computer systems open up the zip, trawl through the XML, and extract the text data out of the document, preserving the overall document structure (what is a paragraph, what is a level 3 heading, inserted images, tables, etc.) in the process of "inserting" it into the systems.

  • (Score: 0, Troll) by Anonymous Coward on Wednesday May 26 2021, @01:59PM

    by Anonymous Coward on Wednesday May 26 2021, @01:59PM (#1138915)

    while this actual serious threat is crashing down onto us, the FOSS community is too busy in a freenode circle jerk to do anything useful

  • (Score: 3, Insightful) by TheGratefulNet on Wednesday May 26 2021, @02:01PM (6 children)

    by TheGratefulNet (659) on Wednesday May 26 2021, @02:01PM (#1138918)

    ahem.

    raise your hand if your lawyer does not have a copy of Word.

    (*crickets*)

    thought so.

    while this is annoying, its not a real world problem. no one but lawyers does the actual wordsmithing and filing of these things.

    I'd bet there's not a single lawyer in the WORLD that has zero access to MS Word.

    Not. A. Single. One.

    --
    "It is now safe to switch off your computer."
    • (Score: 2) by bzipitidoo on Wednesday May 26 2021, @04:49PM

      by bzipitidoo (4388) Subscriber Badge on Wednesday May 26 2021, @04:49PM (#1138996) Journal

      Yes, raise your hand if your lawyer is not a captive of the all embracing Microsoft Office monopoly.

      (*crickets*)

      Monopolies are a real word problem. Where do you start? I can hardly think of a better starting point than a monopoly that is wholly artificial. So, so easy to break. They don't even have to design an open format, there already are a number of acceptable ones. In such a case, the monopolists are the ones that have to fight like mad to keep the monopoly.

      I suppose part of the motivation is that this is, after all, the patent office, who can hardly not be a leading member of the community of Intellectual Property extremism. They no doubt believe their jobs depend upon continued worship of the principles, however flawed, of IP.

      I'd start cracking this egg at government agencies and organizations that aren't so closely connected to IP. For Instance, Argonne National Laboratories uses MS Office last time I looked, about a decade ago.

    • (Score: 5, Insightful) by PinkyGigglebrain on Wednesday May 26 2021, @05:11PM

      by PinkyGigglebrain (4458) on Wednesday May 26 2021, @05:11PM (#1139013)

      I'd bet there's not a single lawyer in the WORLD that has zero access to MS Word.

      You have a valid point, everyone has access to Microsoft's current DOCX format in someway or other. FOR NOW.

      What happen in 5-10 years when MS stops supporting their current DOCX version?

      If your going to say that the Patent office will just need to keep older version of Word handy for this kind of situation your ignoring, or are oblivious to, the logistical and licensing nightmare that would entail. Not to mention the extra burden on the IT staff to archive and maintain the older versions. Add in the fact that MS will eventually stop providing stand alone versions of Office/Word and move everything to the Cloud this move by the USPTO, if they follow through with it, is going to cause major headaches down the road on so many levels its not funny.

      Over the years I've had several situations where someone would come to me and tell me that they can't open this or that document in Word, but it worked fine with the last version of Word. When I looked at the files in question it would invariably turn out that it was in a format that MS used to use and stopped supporting in the latest version off Office/Word. For some of them there was a plug in that would re-enable support, for others the only opton was to try and find an old version of Office or use Open/Libre Office which ironically has better support for older MS formats than MS themselves provide, to open and convert the document to something the users current version of Word could open.

      Legal documents need to be stored in an open, well documented, and widely available format that is accessible by all without having to jump through hoops or pay additional costs to some corporation for the ability to read and use what should be freely available.

      Microsoft's DOCX format is none of those.

      --
      "Beware those who would deny you Knowledge, For in their hearts they dream themselves your Master."
    • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @05:33PM (2 children)

      by Anonymous Coward on Wednesday May 26 2021, @05:33PM (#1139024)

      ahem.

      raise your hand if your lawyer does not have a copy of Word.

      (*crickets*)

      And this is the very reason why DOCX was picked. Almost every single lawyer was already using msword (or, more likely, their secretary was using msword by typing from the dictation tape, but msword still).

      • (Score: 3, Insightful) by canopic jug on Wednesday May 26 2021, @05:45PM (1 child)

        by canopic jug (3949) on Wednesday May 26 2021, @05:45PM (#1139029) Journal

        And this is the very reason why DOCX was picked. Almost every single lawyer was already using msword (or, more likely, their secretary was using msword by typing from the dictation tape, but msword still).

        Ok then. Which version of DOCX or MSWord? It was in the summary, but I'll point out again that DOCX is not a standard format. Not only that, it has changed before and will change again. For a historical example, look at how many incompatible versions of "DOC" there were. The incompatibilities between the different versions were the major reason for being able to force the market into paying for upgrades to new versions as M$ held the monopoly on the file formats and the suite monopoly was a secondary follow-on effect of that. The USPTO is building their registration system on sand by not selecting a standard format or better yet an open standard.

        --
        Money is not free speech. Elections should not be auctions.
        • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @05:54PM

          by Anonymous Coward on Wednesday May 26 2021, @05:54PM (#1139033)

          Ok then. Which version of DOCX or MSWord?

          You'd need to click through to the article, then click through to the federal register notice, to see if a particular DOCX or msword version was specified.

          I have not done so, but I very much suspect that a specific version of DOCX or specific msword version was not specified, and that the notice simply says "DOCX" as if "DOCX" is an unchanging entity.

          The USPTO is building their registration system on sand by not selecting a standard format or better yet an open standard.

          Indeed they are. The choice is pragmatic because almost all the firms are using msword. But yes, the foundation will be shaken when ms releases a newer word and starts adding incompatible bits of XML into the docx files. With the result that USPTO will then have to play "catchup" to the changes.

    • (Score: 0) by Anonymous Coward on Thursday May 27 2021, @01:18PM

      by Anonymous Coward on Thursday May 27 2021, @01:18PM (#1139272)

      they shold standardise on LaTeX, let's make lawyers earn their money for once... actually that's not far enough, they should write their documents in Forth...

  • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @02:17PM (2 children)

    by Anonymous Coward on Wednesday May 26 2021, @02:17PM (#1138929)

    Which of the two is "freer", considering both are proprietary formats owned by Adobe and Micro$oft respectively?

    • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @02:25PM

      by Anonymous Coward on Wednesday May 26 2021, @02:25PM (#1138935)

      Neither, use djvu.

    • (Score: 2) by canopic jug on Wednesday May 26 2021, @02:28PM

      by canopic jug (3949) on Wednesday May 26 2021, @02:28PM (#1138938) Journal

      PDF is ISO 32000 [iso.org] with the PDF/UA variant as ISO 14289 [iso.org]. So being an actual standard, that is substantially more open and facilitating freedom than anything M$ has to offer. The proprietary M$ formats remain undocumented and, probably, undocumentable.

      For what it's worth, M$ still does not have proper support for the OpenDocument Format, ISO 26300. There used to be a plug-in which enabled it, but that was discontinued / blocked and what is in its place loses data, especially formatting. That data loss has not been patched in a decade so it looks quite intentional. The USPTO should 1) not require that filers (whether lawyers or not) be limited to the customers of any single company, and 2) not reward such anti-competitive behaviors.

      --
      Money is not free speech. Elections should not be auctions.
  • (Score: 1, Touché) by Anonymous Coward on Wednesday May 26 2021, @02:19PM

    by Anonymous Coward on Wednesday May 26 2021, @02:19PM (#1138933)

    this is fantastic! we just need to "take out m$" and all patents are LIBERATED! weeeeeh!

    more srsly, the anti grav, home fusion etc patent need to report home when being writen and some "zero day" pulled from the drawer to plausible explain why the computer "suddenly" crashed t-hehehe.

  • (Score: 2) by mcgrew on Wednesday May 26 2021, @02:25PM (1 child)

    by mcgrew (701) <publish@mcgrewbooks.com> on Wednesday May 26 2021, @02:25PM (#1138936) Homepage Journal

    How big of a bribe Microsoft paid, and who they paid it to? That said, most SF magazines insist on .DOC for submissions. It rankles, I hate Microsoft Word. Fortunately, Lo and Oo both write .DOC files now that can be opened with Word.

    --
    Older than dirt? Kid, I was a BETA TESTER for dirt! We never did get all the bugs out.
    • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @02:45PM

      by Anonymous Coward on Wednesday May 26 2021, @02:45PM (#1138947)

      >> That said, most SF magazines insist on .DOC for submissions.

      Exactly as predicted by Philip K. Dick

  • (Score: 4, Interesting) by Mojibake Tengu on Wednesday May 26 2021, @03:10PM (2 children)

    by Mojibake Tengu (8598) on Wednesday May 26 2021, @03:10PM (#1138955) Journal

    Most probably the USPTO backend uses (and expects) resplendent Google DOCX, not Microsoft DOCX format.

    And obviously, those two are not exactly compatible to each other. That would explain those comments of some confused patent submitters about their submissions made in Microsoft are failing for technical reasons.

    You The People are screwed, anyway.

    --
    The edge of 太玄 cannot be defined, for it is beyond every aspect of design
    • (Score: 3, Interesting) by PinkyGigglebrain on Wednesday May 26 2021, @05:18PM (1 child)

      by PinkyGigglebrain (4458) on Wednesday May 26 2021, @05:18PM (#1139017)

      Most probably the USPTO backend uses (and expects) resplendent Google DOCX, not Microsoft DOCX format.

      I didn't know Google had taken a page from Microsoft's playbook. Thank for the info.

      I wonder when Google will start on the "Extinguish" part of MS's EEE strategy.

      --
      "Beware those who would deny you Knowledge, For in their hearts they dream themselves your Master."
      • (Score: 0) by Anonymous Coward on Thursday May 27 2021, @02:44AM

        by Anonymous Coward on Thursday May 27 2021, @02:44AM (#1139165)

        Well, Google has implemented a lot of things and ran them just long enough so no viable competition could arise before deciding to discontinue them. Why extinguish something after the fact if you can also nip it in the bud without the potential fallout that the former strategy could entail if done too overtly.

  • (Score: 1, Funny) by fustakrakich on Wednesday May 26 2021, @03:14PM (4 children)

    by fustakrakich (6150) on Wednesday May 26 2021, @03:14PM (#1138956) Journal

    Why is this being allowed to happen? I mean, we expect this from the republicans, right?

    --
    La politica e i criminali sono la stessa cosa..
    • (Score: 1, Informative) by Anonymous Coward on Wednesday May 26 2021, @03:19PM (1 child)

      by Anonymous Coward on Wednesday May 26 2021, @03:19PM (#1138961)
      • (Score: 1) by fustakrakich on Wednesday May 26 2021, @03:32PM

        by fustakrakich (6150) on Wednesday May 26 2021, @03:32PM (#1138966) Journal

        Well, isn't the goal to maximize your ROI? I mean, politics, it's just business, right?

        --
        La politica e i criminali sono la stessa cosa..
    • (Score: 0, Insightful) by Anonymous Coward on Wednesday May 26 2021, @03:28PM (1 child)

      by Anonymous Coward on Wednesday May 26 2021, @03:28PM (#1138965)

      >> I mean, we expect this from the republicans, right?

      The only difference between the Republicans and Democrats is that the Republicans don't hypocritically pretend to be other than what they are.

      • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @05:49PM

        by Anonymous Coward on Wednesday May 26 2021, @05:49PM (#1139030)

        You mean like pretending to be a small business supporter? Or be against running up huge deficits? Or a supporter and defender of the Constitution? Or a supporter of free trade and open markets?

        Maybe you are right, but only on a technicality. By refusing to lay out a party platform, they established the fact that they do not stand for ANYTHING, except perhaps the belief that goes against the core of what it means to be an American that it is acceptable, nay desirable, to swear fealty to a person over the ideal of democracy (there is actually some pretty old document that gets talked about the beginning of every July that says a thing or two about this topic).

  • (Score: 3, Interesting) by bloodnok on Wednesday May 26 2021, @03:27PM

    by bloodnok (2578) on Wednesday May 26 2021, @03:27PM (#1138963)

    What happens if you hide your *real* patent inside some hidden docx section of another patent? Or just append hidden text to critical sentences to extend the patent's scope beyond all reason.

    Who knows how future versions of Word would handle such things? Maybe the hidden patents will suddenly be visible in word 2024.

    Sounds like a patent troll's (and word developer's) paradise.

    __
    The major

  • (Score: 2) by progo on Wednesday May 26 2021, @03:27PM (1 child)

    by progo (6356) on Wednesday May 26 2021, @03:27PM (#1138964) Homepage

    I heard that there are some gems in the .XLSX format "open" specification saying things like: "This function should behave … uhm … the way it does in Microsoft Excel."

  • (Score: 2) by Dr Spin on Wednesday May 26 2021, @03:44PM

    by Dr Spin (5239) on Wednesday May 26 2021, @03:44PM (#1138970)

    For your patents to apply in Europe, you need European patents, and I doubt you are allowed to file them in an MS proprietary format.

    As to the point that PDF is no longer deterministic, well, I am prepared to bet a significant number of Doge coins* that NOTHING from MS is deterministic,
    nor every will be.

    * Or Dog biscuits, which ever is worth more.

    --
    Warning: Opening your mouth may invalidate your brain!
  • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @05:00PM

    by Anonymous Coward on Wednesday May 26 2021, @05:00PM (#1139005)

    The last part of the quote got clipped off

    To make it easier for you, the USPTO is invalidating all patents on Word and the DOCX format

  • (Score: 3, Interesting) by ElizabethGreene on Wednesday May 26 2021, @07:45PM (4 children)

    by ElizabethGreene (6748) on Wednesday May 26 2021, @07:45PM (#1139067)

    Opinion: I don't understand why they'd do this. It feels like a pdf would be a better choice to me because most organizations block .docx in content inspection.

    Data point:
    Programmatically, it's pretty trivial to get data out of a .docx. Here's an example of content if you unzip one.
    D:\EMFTEST
    │ [Content_Types].xml

    ├───docProps
    │ app.xml
    │ core.xml
    │ custom.xml

    ├───word
    │ │ document.xml
    │ │ endnotes.xml
    │ │ fontTable.xml
    │ │ footnotes.xml
    │ │ settings.xml
    │ │ styles.xml
    │ │ webSettings.xml
    │ │
    │ ├───media
    │ │ image1.png
    │ │
    │ ├───theme
    │ │ theme1.xml
    │ │
    │ └───_rels
    │ document.xml.rels

    └───_rels
                    .rels

    That file contained a line of text with some formatting and an image. The image ends up in the expanded \word\media folder as image1.png. The text is in the \word\document.xml file. When I ran it through a quick strip-all-tags regex "<[^>]*>" search and replace with spaces it was human readable "This is some text In bold In italics bold and an image".

    Question: I've never had to work with PDFs under the hood. Are they similar or more difficult to parse?

    • (Score: 2, Interesting) by Anonymous Coward on Wednesday May 26 2021, @09:35PM

      by Anonymous Coward on Wednesday May 26 2021, @09:35PM (#1139102)

      Question: I've never had to work with PDFs under the hood. Are they similar or more difficult to parse?

      That depends upon how you define "more difficult".

      PDFs under the hood are literally just a sequence of "instructions" to an abstract virtual printer to position bits of text on a two dimensional pixel grid. And depending upon how the print driver decided to generate the pdf, that positioning could be:

      1. positioning individual letters
      2. positioning individual words
      3. positioning portions of lines
      4. positioning whole lines
      5. a mixture of all of the above

      But, there is no way to "position" anything larger than a line. PDF includes no ability to do internal word-wrap for a block of text. So the largest piece of text that can be positioned is a whole line. Any word-wrap has to be performed by the generator as it generates the PDF file.

      What goes missing in the conversion to PDF is any direct indication of higher level document structure (i.e., where paragraphs begin/end, which line of bold text is a level 3 header vs. a level 2 header, which sequences of text were lists in the original doc, whether a block of text was actually a table in the original document, etc.).

      This loss of higher level doc structure is why the switch from PDF to DOCX. The DOCX file contains data relating to those higher level structural components, so that data can be directly preserved and extracted.

      Using a PDF as input means that higher level document structure has to be inferred by a lot of AI like program code that will not get the inference correct in all cases.

      Plus, a little known aspect of PDF (unless one is trying to write a PDF generator) is that PDF allows for an arbitrary mapping between byte values in the PDF file to font faces that appear on the "virtual paper" that is the PDF output image. I.e., normally, in ASCII/Unicode, a value 65 decimal is a capital letter A. But in a PDF file, one can define any byte value to map to the font face that draws a letter A. So inside the PDF, letter A could be decimal 65, but it could also be decimal zero or decimal 37 or decimal 255. The pdf, when rendered to an image, would show a capital A for any of those codes, but for extracting text back out of the pdf, the PDF is supposed to contain a optional mapping table that says "in this PDF, for this font, a decimal 46 is a unicode capital letter A". Sadly, because this table is optional, some print drivers omit it to make "a smaller pdf file" with the result that while the PDF will print or view correctly, there is no way to extract the text from the pdf without doing something like running OCR against it.

    • (Score: 2, Interesting) by Anonymous Coward on Wednesday May 26 2021, @09:47PM (2 children)

      by Anonymous Coward on Wednesday May 26 2021, @09:47PM (#1139105)

      The PDF instructions to "draw" a line of text equivalent to your 'line' above could be something like this:

      BT
      /Times-Roman 12 Tf
      1 0 0 1 0.0 0.0 Tm
      (This is some text In) Tj
      /Times-Bold 12 Tf
      1 0 0 1 21.0 0.0 Tm
      (bold) Tj
      /Times-Roman 12 Tf
      1 0 0 1 25.0 0.0 Tm
      (In) Tj
      /Times-BoldItalics 12 Tf
      1 0 0 1 29.0 0.0 Tm
      (italics bold) Tj
      /Times-BolldItalics 12 Tf
      1 0 0 1 43.0 0.0 Tm
      (and an image) Tj
      ET

      Note: I say "could" because this is not the only way this text could be typeset. The driver could position and draw each letter separately. The driver could group together text under the same font and do a font change (the /Times-* Tf lines) once, then position all of the text under than font, then change font, and position the next text pieces. The driver could draw things in reverse, up the page, down the page, diagonally, in a circle, etc. The end result, as long as the same pixels are turned black, is irrelevant to you viewing the final product. But all those possible combinations makes extracting text very challenging. Certainly not as easy as "remove XML tags and what is left is text.

      • (Score: 2) by ElizabethGreene on Wednesday May 26 2021, @10:04PM (1 child)

        by ElizabethGreene (6748) on Wednesday May 26 2021, @10:04PM (#1139108)

        Thanks for explaining it.

        I don't want to undermine the other comments about undocumented features though. I have no doubt that some word documents have the similar fiddly bits around typesetting, positioning, etc. That's inevitable when you've got 35+ years of backwards compatibility under the hood.

        • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @10:18PM

          by Anonymous Coward on Wednesday May 26 2021, @10:18PM (#1139111)

          Very true, although for newly authored documents in newer versions of word (i.e., those that support DOCX) most of those 35+ years of cruft is not present in the DOCX file. And, of course, an old copy of word that never understood DOCX can't save it's cruft into a DOCX file, because it can't write DOCX. One would have to open a ancient .doc in a modern word then save as DOCX to even cause some of that "really old crud" to even appear.

          Is DOCX the best choice from a future compatibility standpoint? Most definitely not.

          Is DOCX a pragmatic choice that allows extracting text and document structure (lists/tables/images, paragraphs/etc.) because msword is what most of the law firms use, so they (the firms/attys) don't have to change much at all to supply the files? Most definitely yes.

          So it is a trade off of needing to play "tail wags dog games" periodically as ms changes the DOCX format vs. having every firm/atty. balk at the change and dig their feet in and refuse to change (whereupon one gets nowhere).

          The current PDF filing system is essentially little more than "electrification of a US Postal Service envelope". It is based upon supplying virtual sheets of printed paper in a virtual PDF envelope and is as close as possible to an exact clone of "print doc to paper, stuff paper in USPS envelope, mail envelope (with sufficient postage) to the USPTO" process that occurred before as can possibly be. And that minimal shift from physical paper and USPS to virtual paper and the internet was the way it was largely to prevent "law-firms dig heels in, refuse to go along".

  • (Score: 1) by jman on Thursday May 27 2021, @12:57PM

    by jman (6085) Subscriber Badge on Thursday May 27 2021, @12:57PM (#1139266) Homepage

    Non-issue since LibreOffice can read and write M$ files.

(1)