Uploaded image for project: 'Sakai'
  1. Sakai
  2. SAK-39284

Tika is detecting some file types incorrectly

    Details

    • Type: Bug
    • Status: Verified
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.0
    • Fix Version/s: 10.6, 11.0
    • Component/s: Kernel
    • Labels:
      None
    • 10 status:
      Resolved
    • Property addition/change required:
      Yes
    • Previous Issue Keys:
      KNL-1306
    • Test Plan:
      Hide

      The mime type should be "text/javascript" or something with javascript in it. Previously it was being detected as application/xhtml+xml or text/html. A special case is added by default to ignore files with a .js extension from including the body of the file and just relying on the extension.

      Ideally we'd also want to test the other property also works. But I'm not quite so sure if files to test with that one.

      content.mimeMagic.ignorecontent.mimetypes=text/html,text/plain

      Show
      Download http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.6.1.min.js Upload this file to Sakai resources Look at Edit Details on the file The mime type should be "text/javascript" or something with javascript in it. Previously it was being detected as application/xhtml+xml or text/html. A special case is added by default to ignore files with a .js extension from including the body of the file and just relying on the extension. Ideally we'd also want to test the other property also works. But I'm not quite so sure if files to test with that one. content.mimeMagic.ignorecontent.mimetypes=text/html,text/plain

      Description

      For a few files mentioned in KNL-1278 tika is detecting incorrectly. We need to either

      • Find a way to override the tika-mimetypes.xml definitions for magic priority to improve the detection (and contribute back)
        or
      • Have a way to just disable files with certain extensions from going through the detector
        or
      • Just not allow certain types to be returned

      Now sure of the best strategy, but there's enough specific files that it's a problem.

      It was mentioned that the file jquery-1.6.1.min.js (http://blog.jquery.com/2011/05/12/jquery-1-6-1-released/) was being detected as text/html. This is probably because of this issue in tika

      https://issues.apache.org/jira/browse/TIKA-1141

      Because the javascript contains the <html code.

      It was also mentioned that some html files that contain <html xmlns= are being returned as application/xhtml+xml, but they really aren't and this is causing a problem for internet explorer.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  jonespm Matthew Jones
                  Reporter:
                  jonespm Matthew Jones
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  7 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Git Source Code