-
Type:
Bug
-
Status: Verified
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 11.1
-
Fix Version/s: 11.5 [Tentative], 12.0
-
Component/s: Kernel
-
Labels:None
-
11 status:Resolved
-
Previous Issue Keys:KNL-1471
https://dist.apache.org/repos/dist/release/tika/CHANGES-1.13.txt
Release 1.13 - 05/08/2016
- Upgrade to PDFBox 2.0.1 (TIKA-1285/TIKA-1959).
MAJOR CHANGES in PDFParser: - The classic sequential parser is no longer available.
- Tiff files are no longer extracted by default. See
https://pdfbox.apache.org/2.0/dependencies.html#optional-components
for optional components to process Tiff files. - Some truncated/corrupted files that had some content extracted
with 1.8.x may have no content extracted in 2.0.x (see TIKA-1912).
- The MIT-NLP Information Extraction (MITIE) Named Entity
Recognition (NER) system is now supported in Tika
(TIKA-1913, GitHub-108).
- Tika now supports the use of the Yandex translation
service (TIKA-1943, GitHub-106).
- Tika now uses NER to extract scientific measurements
from text using either GROBID Quantities which uses
conditional random fields and NLTK which uses regular
expressesions (TIKA-1917, GitHub-104).
- Fixed JournalParser to handle null responses from
GROBID and to log a message (TIKA-1925).
- Refactored Language Detector into tika-landetect module,
added default N-Gram implementation, Optimaize Lang
Detector and MIT Text.jl implementation
(TIKA-1872, TIKA-1696, TIKA-1723).
- Extract metadata from MP4 videos whether or not the
PooledTimeSeries parser is available via Aditya Dhulipala
(TIKA-1844).
- Fix NPE when trying to get embedded image identifier in
WordParser (TIKA-1956).
- Improvements to MIME database for detection of Scientific
and other formats present in the TREC-DD-Polar dataset
(TIKA-1881, GitHub-85, TIKA-1883, TIKA-1884, TIKA-1886,
TIKA-1882).
- LinkContentHandler now extracts links from script tags
via Joseph Naegele (TIKA-1937).
- Handle per page IOExceptions more robustly in PDFParser (TIKA-1948).
- Upgrade commons-compress to 1.11 (TIKA-1949).
- Add detection for embedded MSChart.Graph files (TIKA-1033).
- Fix NPE in Sqlite parser from Nick C (TIKA-1927).
- Fix NPE in Open Document parser from Nick C (TIKA-1916).
- Upgrade mp4parser's isoparser to 1.1.7 (TIKA-1924 and TIKA-1931).
- Upgrade BouncyCastle to 1.54 (TIKA-1923).
- Upgrade Jackcess to 2.1.3 (TIKA-1922).
- Upgrade Drew Noakes' metadata-extractor to 2.8.1 (TIKA-1921).
- Upgrade Gson in tika-serialization to 2.6.2 (TIka-1920).
- Upgrade commons-cli in tika-batch to 1.3.1 (TIKA-1919).
- Add XMPMM support to PDFParser and JpegParser via Jempbox (TIKA-1894).
- Move serialization of TikaConfig to tika-core and enable dumping
of the config file via tika-app (TIKA-1657).
- Tika now incorporates the Natural Language Toolkit (NLTK) from the
Python community as an option for Named Entity Recognition (TIKA-1876).
- Add support for XFA extraction via Pascal Essiembre (TIKA-1857).
- Upgrade to sqlite-jdbc 3.8.11.2 (TIKA-1861). NOTE: this dependency
is still <scope>provided</scope>. You need to include this dependency
in order to parse sqlite files.
- Upgrade to POI 3.15-beta1 (TIKA-1895).
- Upgrade to Jackson 2.7.1 (TIKA-1869).
- Upgrade to Apache SIS 0.6 (TIKA-1878).
- RichTextContentHandler moved from the Server package to Core (TIKA-1870).
- Added ZeroSizeFileDetector to support application/x-zerovalue via
Adesh Gupta (TIKA-1885).
- Addition of types information to Grobid quantities parser via
Can Menekse (TIKA-1965).