Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Wednesday May 26 2021, @01:13PM   Printer-friendly
from the vendor-capture dept.

There are still a few months to fix this, but for now the US Patent and Trademark Office's (USPTO) Acting Commissioner for Patents, Andrew Faile, and Chief Information Officer, Jamie Holcombe, have announced that starting January 1st, 2022, the USPTO will institute a surcharge for applicants that are not locked into Microsoft products via the proprietary DOCX format. From that date onwards, the USPTO will move away from PDF and require all filers to use that proprietary format or face an arbitrary surcharge when filing.

First, we delayed the effective date for the non-DOCX surcharge fee to January 1, 2022, to provide more time for applicants to transition to this new process, and for the USPTO to continue our outreach efforts and address customer concerns. We've also made office actions available in DOCX and XML formats and further enhanced DOCX features, including accepting DOCX for drawings in addition to the specification, claims, and abstract for certain applications.

One out of several major problems with the plans is that DOCX is a proprietary format. There are several variants of DOCX and each of them are really only supported by a single company's products. Some other products have had progress in beginning to reverse engineering it, but are hindered by the lack of documentation. DOCX is a competitor to the fully-documented, open standard OpenDocument Format, also known as ISO/IEC 26300.

DOCX is not to be confused with OOXML, though it often is. While OOXML, also known as ISO/IEC 29500, is technically standardized, it is incompletely documented and only vaguely related to DOCX. The DOCX format itself is neither fully documented nor standard. So the USPTO is also engaged in spreading disinformation by asserting that it is.

Previously:
(2015) Microsoft Threatened the UK Over Open Standards


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Interesting) by Anonymous Coward on Wednesday May 26 2021, @09:35PM

    by Anonymous Coward on Wednesday May 26 2021, @09:35PM (#1139102)

    Question: I've never had to work with PDFs under the hood. Are they similar or more difficult to parse?

    That depends upon how you define "more difficult".

    PDFs under the hood are literally just a sequence of "instructions" to an abstract virtual printer to position bits of text on a two dimensional pixel grid. And depending upon how the print driver decided to generate the pdf, that positioning could be:

    1. positioning individual letters
    2. positioning individual words
    3. positioning portions of lines
    4. positioning whole lines
    5. a mixture of all of the above

    But, there is no way to "position" anything larger than a line. PDF includes no ability to do internal word-wrap for a block of text. So the largest piece of text that can be positioned is a whole line. Any word-wrap has to be performed by the generator as it generates the PDF file.

    What goes missing in the conversion to PDF is any direct indication of higher level document structure (i.e., where paragraphs begin/end, which line of bold text is a level 3 header vs. a level 2 header, which sequences of text were lists in the original doc, whether a block of text was actually a table in the original document, etc.).

    This loss of higher level doc structure is why the switch from PDF to DOCX. The DOCX file contains data relating to those higher level structural components, so that data can be directly preserved and extracted.

    Using a PDF as input means that higher level document structure has to be inferred by a lot of AI like program code that will not get the inference correct in all cases.

    Plus, a little known aspect of PDF (unless one is trying to write a PDF generator) is that PDF allows for an arbitrary mapping between byte values in the PDF file to font faces that appear on the "virtual paper" that is the PDF output image. I.e., normally, in ASCII/Unicode, a value 65 decimal is a capital letter A. But in a PDF file, one can define any byte value to map to the font face that draws a letter A. So inside the PDF, letter A could be decimal 65, but it could also be decimal zero or decimal 37 or decimal 255. The pdf, when rendered to an image, would show a capital A for any of those codes, but for extracting text back out of the pdf, the PDF is supposed to contain a optional mapping table that says "in this PDF, for this font, a decimal 46 is a unicode capital letter A". Sadly, because this table is optional, some print drivers omit it to make "a smaller pdf file" with the result that while the PDF will print or view correctly, there is no way to extract the text from the pdf without doing something like running OCR against it.

    Starting Score:    0  points
    Moderation   +2  
       Interesting=1, Informative=1, Total=2
    Extra 'Interesting' Modifier   0  

    Total Score:   2