Uploaded image for project: 'Sakai'
  1. Sakai
  2. SAK-41874

Apache Tika 1.22

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: RESOLVED
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 19.1
    • Fix Version/s: 20.0 [Tentative]
    • Component/s: Kernel, Master
    • Labels:
      None
    • Test Plan:
      Hide

      Please add a Test Plan here.

      Show
      Please add a Test Plan here.

      Description

      Release 1.22 - 07/29/2019

      • NOTE: Known regression: PDFBOX-4587 – PDF passwords with codepoints
        between 0xF000 and 0XF0000 will cause an exception.
      • Add parser for HWP v5 files via SooMyung Lee (soomyung) and
        JinSup Kim (ddoleye) (TIKA-2909).
      • Fix order of closing streams to avoid "Failed to close temporary resource"
        exception in TesseractOCRParser (TIKA-2908).
      • Improve AutoDetectReader performance by caching encoding
        detector (TIKA-1568).
      • Prevent RTFParser from outputting illegal tag combinations (TIKA-2889).
      • Fix RereadableInputStream to release all resources (TIKA-2903).
      • Implement custom language identifier in the tika-eval module based on
        OpenNLP's language detector; add 18 languages and add common words
        lists for all 121 languages (TIKA-2790).
      • Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896).
      • Fix RTFParser to extract more content (TIKA-2883).
      • Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898).
      • Improve StreamingZipContainerDetector for xltx, xltm and
        several other file formats (TIKA-2886).

      Release 1.21 - 05/14/2019

      • Add optional AUTO mode to OCR'ing of PDFs. If tesseract is installed
        and on the path, and this option is selected programmatically
        or via TikaConfig(), the PDFParser will use heuristics to decide
        whether or not to run OCR per page on PDFs. (TIKA-2749)
      • The ZipContainerDetector's default behavior was changed to run
        streaming detection up to its markLimit. Users can get the
        legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream)
        by setting markLimit=-1. The POIFSContainerDetector requires an underlying file;
        it will try to spool the file to disk; if the file's length is > markLimit,
        it will not attempt detection; set markLimit to -1 for legacy behavior (TIKA-2849).
      • Upgrade PDFBox to 2.0.14 (TIKA-2834).
      • Add CSV detection and replace TXTParser with TextAndCSVParser;
        users can turn off CSV detection by excluding the TextAndCSVParser
        and adding back the TXTParser via tika-config (TIKA-2833).
      • Add a CSVParser. CSV detection is currently based solely on filename
        and/or information conveyed via Metadata (TIKA-2826).
      • General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf,
        guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, parso,
        sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824)
      • Bundle xerces2 with tika-parsers (TIKA-2802).
      • Upgrade jaxb to 2.3.2 (TIKA-2819).
      • Upgrade jackson to 2.9.8 (TIKA-2717).
      • Update tika-eval's common tokens lists (TIKA-2822).
      • Handle bad tags in tika-eval more robustly (TIKA-2810).
      • Add reports for tags in tika-eval (TIKA-2809).
      • Extract text from SDT element within textboxes in .docx files (TIKA-2807).
      • Try to handle truncated OOXML files more robustly (TIKA-2765).

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  dhorwitz David Horwitz
                  Reporter:
                  dhorwitz David Horwitz
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Git Source Code