Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.
posted by Fnord666 on Wednesday May 26 2021, @01:13PM   Printer-friendly
from the vendor-capture dept.

There are still a few months to fix this, but for now the US Patent and Trademark Office's (USPTO) Acting Commissioner for Patents, Andrew Faile, and Chief Information Officer, Jamie Holcombe, have announced that starting January 1st, 2022, the USPTO will institute a surcharge for applicants that are not locked into Microsoft products via the proprietary DOCX format. From that date onwards, the USPTO will move away from PDF and require all filers to use that proprietary format or face an arbitrary surcharge when filing.

First, we delayed the effective date for the non-DOCX surcharge fee to January 1, 2022, to provide more time for applicants to transition to this new process, and for the USPTO to continue our outreach efforts and address customer concerns. We've also made office actions available in DOCX and XML formats and further enhanced DOCX features, including accepting DOCX for drawings in addition to the specification, claims, and abstract for certain applications.

One out of several major problems with the plans is that DOCX is a proprietary format. There are several variants of DOCX and each of them are really only supported by a single company's products. Some other products have had progress in beginning to reverse engineering it, but are hindered by the lack of documentation. DOCX is a competitor to the fully-documented, open standard OpenDocument Format, also known as ISO/IEC 26300.

DOCX is not to be confused with OOXML, though it often is. While OOXML, also known as ISO/IEC 29500, is technically standardized, it is incompletely documented and only vaguely related to DOCX. The DOCX format itself is neither fully documented nor standard. So the USPTO is also engaged in spreading disinformation by asserting that it is.

Previously:
(2015) Microsoft Threatened the UK Over Open Standards


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by ElizabethGreene on Wednesday May 26 2021, @07:45PM (4 children)

    by ElizabethGreene (6748) Subscriber Badge on Wednesday May 26 2021, @07:45PM (#1139067) Journal

    Opinion: I don't understand why they'd do this. It feels like a pdf would be a better choice to me because most organizations block .docx in content inspection.

    Data point:
    Programmatically, it's pretty trivial to get data out of a .docx. Here's an example of content if you unzip one.
    D:\EMFTEST
    │ [Content_Types].xml

    ├───docProps
    │ app.xml
    │ core.xml
    │ custom.xml

    ├───word
    │ │ document.xml
    │ │ endnotes.xml
    │ │ fontTable.xml
    │ │ footnotes.xml
    │ │ settings.xml
    │ │ styles.xml
    │ │ webSettings.xml
    │ │
    │ ├───media
    │ │ image1.png
    │ │
    │ ├───theme
    │ │ theme1.xml
    │ │
    │ └───_rels
    │ document.xml.rels

    └───_rels
                    .rels

    That file contained a line of text with some formatting and an image. The image ends up in the expanded \word\media folder as image1.png. The text is in the \word\document.xml file. When I ran it through a quick strip-all-tags regex "<[^>]*>" search and replace with spaces it was human readable "This is some text In bold In italics bold and an image".

    Question: I've never had to work with PDFs under the hood. Are they similar or more difficult to parse?

    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 2, Interesting) by Anonymous Coward on Wednesday May 26 2021, @09:35PM

    by Anonymous Coward on Wednesday May 26 2021, @09:35PM (#1139102)

    Question: I've never had to work with PDFs under the hood. Are they similar or more difficult to parse?

    That depends upon how you define "more difficult".

    PDFs under the hood are literally just a sequence of "instructions" to an abstract virtual printer to position bits of text on a two dimensional pixel grid. And depending upon how the print driver decided to generate the pdf, that positioning could be:

    1. positioning individual letters
    2. positioning individual words
    3. positioning portions of lines
    4. positioning whole lines
    5. a mixture of all of the above

    But, there is no way to "position" anything larger than a line. PDF includes no ability to do internal word-wrap for a block of text. So the largest piece of text that can be positioned is a whole line. Any word-wrap has to be performed by the generator as it generates the PDF file.

    What goes missing in the conversion to PDF is any direct indication of higher level document structure (i.e., where paragraphs begin/end, which line of bold text is a level 3 header vs. a level 2 header, which sequences of text were lists in the original doc, whether a block of text was actually a table in the original document, etc.).

    This loss of higher level doc structure is why the switch from PDF to DOCX. The DOCX file contains data relating to those higher level structural components, so that data can be directly preserved and extracted.

    Using a PDF as input means that higher level document structure has to be inferred by a lot of AI like program code that will not get the inference correct in all cases.

    Plus, a little known aspect of PDF (unless one is trying to write a PDF generator) is that PDF allows for an arbitrary mapping between byte values in the PDF file to font faces that appear on the "virtual paper" that is the PDF output image. I.e., normally, in ASCII/Unicode, a value 65 decimal is a capital letter A. But in a PDF file, one can define any byte value to map to the font face that draws a letter A. So inside the PDF, letter A could be decimal 65, but it could also be decimal zero or decimal 37 or decimal 255. The pdf, when rendered to an image, would show a capital A for any of those codes, but for extracting text back out of the pdf, the PDF is supposed to contain a optional mapping table that says "in this PDF, for this font, a decimal 46 is a unicode capital letter A". Sadly, because this table is optional, some print drivers omit it to make "a smaller pdf file" with the result that while the PDF will print or view correctly, there is no way to extract the text from the pdf without doing something like running OCR against it.

  • (Score: 2, Interesting) by Anonymous Coward on Wednesday May 26 2021, @09:47PM (2 children)

    by Anonymous Coward on Wednesday May 26 2021, @09:47PM (#1139105)

    The PDF instructions to "draw" a line of text equivalent to your 'line' above could be something like this:

    BT
    /Times-Roman 12 Tf
    1 0 0 1 0.0 0.0 Tm
    (This is some text In) Tj
    /Times-Bold 12 Tf
    1 0 0 1 21.0 0.0 Tm
    (bold) Tj
    /Times-Roman 12 Tf
    1 0 0 1 25.0 0.0 Tm
    (In) Tj
    /Times-BoldItalics 12 Tf
    1 0 0 1 29.0 0.0 Tm
    (italics bold) Tj
    /Times-BolldItalics 12 Tf
    1 0 0 1 43.0 0.0 Tm
    (and an image) Tj
    ET

    Note: I say "could" because this is not the only way this text could be typeset. The driver could position and draw each letter separately. The driver could group together text under the same font and do a font change (the /Times-* Tf lines) once, then position all of the text under than font, then change font, and position the next text pieces. The driver could draw things in reverse, up the page, down the page, diagonally, in a circle, etc. The end result, as long as the same pixels are turned black, is irrelevant to you viewing the final product. But all those possible combinations makes extracting text very challenging. Certainly not as easy as "remove XML tags and what is left is text.

    • (Score: 2) by ElizabethGreene on Wednesday May 26 2021, @10:04PM (1 child)

      by ElizabethGreene (6748) Subscriber Badge on Wednesday May 26 2021, @10:04PM (#1139108) Journal

      Thanks for explaining it.

      I don't want to undermine the other comments about undocumented features though. I have no doubt that some word documents have the similar fiddly bits around typesetting, positioning, etc. That's inevitable when you've got 35+ years of backwards compatibility under the hood.

      • (Score: 0) by Anonymous Coward on Wednesday May 26 2021, @10:18PM

        by Anonymous Coward on Wednesday May 26 2021, @10:18PM (#1139111)

        Very true, although for newly authored documents in newer versions of word (i.e., those that support DOCX) most of those 35+ years of cruft is not present in the DOCX file. And, of course, an old copy of word that never understood DOCX can't save it's cruft into a DOCX file, because it can't write DOCX. One would have to open a ancient .doc in a modern word then save as DOCX to even cause some of that "really old crud" to even appear.

        Is DOCX the best choice from a future compatibility standpoint? Most definitely not.

        Is DOCX a pragmatic choice that allows extracting text and document structure (lists/tables/images, paragraphs/etc.) because msword is what most of the law firms use, so they (the firms/attys) don't have to change much at all to supply the files? Most definitely yes.

        So it is a trade off of needing to play "tail wags dog games" periodically as ms changes the DOCX format vs. having every firm/atty. balk at the change and dig their feet in and refuse to change (whereupon one gets nowhere).

        The current PDF filing system is essentially little more than "electrification of a US Postal Service envelope". It is based upon supplying virtual sheets of printed paper in a virtual PDF envelope and is as close as possible to an exact clone of "print doc to paper, stuff paper in USPS envelope, mail envelope (with sufficient postage) to the USPTO" process that occurred before as can possibly be. And that minimal shift from physical paper and USPS to virtual paper and the internet was the way it was largely to prevent "law-firms dig heels in, refuse to go along".