Details
Description
For a few files mentioned in KNL-1278 tika is detecting incorrectly. We need to either
- Find a way to override the tika-mimetypes.xml definitions for magic priority to improve the detection (and contribute back)
or - Have a way to just disable files with certain extensions from going through the detector
or - Just not allow certain types to be returned
Now sure of the best strategy, but there's enough specific files that it's a problem.
It was mentioned that the file jquery-1.6.1.min.js (http://blog.jquery.com/2011/05/12/jquery-1-6-1-released/) was being detected as text/html. This is probably because of this issue in tika
https://issues.apache.org/jira/browse/TIKA-1141
Because the javascript contains the <html code.
It was also mentioned that some html files that contain <html xmlns= are being returned as application/xhtml+xml, but they really aren't and this is causing a problem for internet explorer.