Kernel
  1. Kernel
  2. KNL-589

Automatic file encoding detection

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.11
    • Fix Version/s: 1.1.14, 1.2.1
    • Component/s: Impl
    • Labels:
      None
    • 1.2 Status:
      Closed
    • CLE Team Issue:
      Yes
    • Previous Issue Keys:
      SAK-17558

      Description

      If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols. That's because charset is not set neither in ResourceProperties when you upload the file nor BaseContentService when serving the file. I attach the patch for our 2.6.x installation under kernel 1.0.12.
      1. content_encoding.txt
        4 kB
        David Roldán Martínez
      2. ejemplo.htm
        4 kB
        Sam Ottenhoff
      3. SAK-17558-kernel-DH.diff
        4 kB
        David Horwitz

        Issue Links

          Activity

          Hide
          David Horwitz added a comment -
          MAINTANCE TEAM: Unassigned so these get reviewed by the Maintance Team
          Show
          David Horwitz added a comment - MAINTANCE TEAM: Unassigned so these get reviewed by the Maintance Team
          Hide
          Anthony Vargo added a comment -
          2.7.x, r73739
          Show
          Anthony Vargo added a comment - 2.7.x, r73739
          Hide
          David Roldán Martínez added a comment -
          Hi David,

          We have found some issues in the patch I provided but I'm not sure if it fails only for resources stored before to apply the patch or not. If you create a new resource or replace an existing one, it works well.

          David :)
          Show
          David Roldán Martínez added a comment - Hi David, We have found some issues in the patch I provided but I'm not sure if it fails only for resources stored before to apply the patch or not. If you create a new resource or replace an existing one, it works well. David :)
          Hide
          David Roldán Martínez added a comment -
          The issues we found weren't related with this. The patch is 100 % functional.
          Show
          David Roldán Martínez added a comment - The issues we found weren't related with this. The patch is 100 % functional.
          Hide
          David Horwitz added a comment -
          This causes the regression described in SAK-18557 - the input stream is read past its beginning causing file corruption
          Show
          David Horwitz added a comment - This causes the regression described in SAK-18557 - the input stream is read past its beginning causing file corruption
          Hide
          David Horwitz added a comment -
          looking at the code this may be tricky to fix at this level - may be better to do this in the kernel
          Show
          David Horwitz added a comment - looking at the code this may be tricky to fix at this level - may be better to do this in the kernel
          Hide
          Anthony Whyte added a comment - - edited
          Reverted the fix in the 2.7.0 branch (David handled 2.7.x). I should note that the ibm dependency added r73739 has not been removed from the tool pom in either branch.

          <dependency>
          <groupId>com.ibm.icu</groupId>
          <artifactId>icu4j</artifactId>
          <version>4.2.1</version>
          </dependency>

          Show
          Anthony Whyte added a comment - - edited Reverted the fix in the 2.7.0 branch (David handled 2.7.x). I should note that the ibm dependency added r73739 has not been removed from the tool pom in either branch. <dependency> <groupId>com.ibm.icu</groupId> <artifactId>icu4j</artifactId> <version>4.2.1</version> </dependency>
          Hide
          Charles Hedrick added a comment - - edited
          Just to be clear, the problem is that the detector reads data. So if you pass it a stream, when it's finished the first 8000 bytes of the stream have been read. Not surprisingly, the file as uploaded is missing the first 8000 bytes.

          If you think this is important, you can do something like

          if (stream.markSupported()) {
             try {
                stream.mark(32);
                byte[32] buffer;
                stream.read(buffer);
                detector.setText(buffer)
                encoding = detector.detect().getName();
             } catch (Exception e) {
                encoding = "UTF-8";
             } finally {
                stream.reset();
             }
          }

          where 32 is some number large enough for the detector. I'm not sure what this is. However you'd need to verify that this works if end of file occurs while you're reading the 32 bytes.

          Is there a better approach then heuristic detection? Does the browser set an encoding when uploading? Is this reliable enough to use? (Given the state of browsers I'm prepared to believe that it's not.)

          Show
          Charles Hedrick added a comment - - edited Just to be clear, the problem is that the detector reads data. So if you pass it a stream, when it's finished the first 8000 bytes of the stream have been read. Not surprisingly, the file as uploaded is missing the first 8000 bytes. If you think this is important, you can do something like if (stream.markSupported()) {    try {       stream.mark(32);       byte[32] buffer;       stream.read(buffer);       detector.setText(buffer)       encoding = detector.detect().getName();    } catch (Exception e) {       encoding = "UTF-8";    } finally {       stream.reset();    } } where 32 is some number large enough for the detector. I'm not sure what this is. However you'd need to verify that this works if end of file occurs while you're reading the 32 bytes. Is there a better approach then heuristic detection? Does the browser set an encoding when uploading? Is this reliable enough to use? (Given the state of browsers I'm prepared to believe that it's not.)
          Hide
          Stephen Marquard added a comment -
          What's the status of this fix? It seems to have been committed but then reverted from 2-7-x and 2.7.0. Is the fix wrong?
          Show
          Stephen Marquard added a comment - What's the status of this fix? It seems to have been committed but then reverted from 2-7-x and 2.7.0. Is the fix wrong?
          Hide
          David Roldán Martínez added a comment -
          The fix is OK. We have it in production environment since april 2010
          Show
          David Roldán Martínez added a comment - The fix is OK. We have it in production environment since april 2010
          Hide
          David Horwitz added a comment -
          A couple of notes:

          the attached patch doesn't match the UPV commit to their branch - that patch def doesn't work.

          The code in the UVP branch looks like it needs to be reviewed both for the reading error and in that it apears to load the file into memory
          Show
          David Horwitz added a comment - A couple of notes: the attached patch doesn't match the UPV commit to their branch - that patch def doesn't work. The code in the UVP branch looks like it needs to be reviewed both for the reading error and in that it apears to load the file into memory
          Hide
          Stephen Marquard added a comment -
          Reverted the trunk commit in r81578 as it causes regressions as reported in SAK-18996.
          Show
          Stephen Marquard added a comment - Reverted the trunk commit in r81578 as it causes regressions as reported in SAK-18996 .
          Hide
          David Horwitz added a comment -
          A better place to do this would be in the kernel in the commitResource method - we already have logic for handling this kind of spooling behavior (in the antivirus code). This fix will only check files uploaded in the resources and not via other methods (dav etc)
          Show
          David Horwitz added a comment - A better place to do this would be in the kernel in the commitResource method - we already have logic for handling this kind of spooling behavior (in the antivirus code). This fix will only check files uploaded in the resources and not via other methods (dav etc)
          Hide
          David Horwitz added a comment -
          This patch proposes a minimalist approach based on the code in the UPV branch that both avoids the spooling problem reported my chuck and incorporating his suggestion for only limited text being loaded into arrays to avoid problems identified earlier.

          David - would yo be able to help test this?
          Show
          David Horwitz added a comment - This patch proposes a minimalist approach based on the code in the UPV branch that both avoids the spooling problem reported my chuck and incorporating his suggestion for only limited text being loaded into arrays to avoid problems identified earlier. David - would yo be able to help test this?
          Hide
          David Horwitz added a comment -
          Tested and working ok.

          My concern is, what happens with content previously uploaded? Do you think it will be needed to run a script to update resource properties?

          David
          ________________________________________
          De: David Horwitz [david.horwitz@uct.ac.za]
          Enviado el: viernes, 17 de septiembre de 2010 12:33
          Para: DAVID ROLDAN MARTINEZ
          Asunto: SAK-17558 - file encoding detection

          Hi David,

          I have added a minimalist patch to SAK-17558 based on your work. Would
          you be able to help test/review this?

          Thanks

          David
          Show
          David Horwitz added a comment - Tested and working ok. My concern is, what happens with content previously uploaded? Do you think it will be needed to run a script to update resource properties? David ________________________________________ De: David Horwitz [ david.horwitz@uct.ac.za ] Enviado el: viernes, 17 de septiembre de 2010 12:33 Para: DAVID ROLDAN MARTINEZ Asunto: SAK-17558 - file encoding detection Hi David, I have added a minimalist patch to SAK-17558 based on your work. Would you be able to help test/review this? Thanks David
          Hide
          Sam Ottenhoff added a comment -
          I believe this issue may have resulted in the regression described in KNL-653 (email sent out to users despite setting notification to zero). Can the watchers on this ticket take a quick look at that issue?
          Show
          Sam Ottenhoff added a comment - I believe this issue may have resulted in the regression described in KNL-653 (email sent out to users despite setting notification to zero). Can the watchers on this ticket take a quick look at that issue?
          Hide
          Jean-François Lévêque added a comment -
          I cannot find anything obvious in the changes, Sam. Any other suspect?
          Show
          Jean-François Lévêque added a comment - I cannot find anything obvious in the changes, Sam. Any other suspect?
          Hide
          David Horwitz added a comment -
          The second update done when fixing the encoding was certainly using the wrong notification options. This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)
          Show
          David Horwitz added a comment - The second update done when fixing the encoding was certainly using the wrong notification options. This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)
          Hide
          Sam Ottenhoff added a comment -
          > This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)

          We could replicate the incorrect notifications with any type of file upload: images, PDFs, etc.
          Show
          Sam Ottenhoff added a comment - > This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8) We could replicate the incorrect notifications with any type of file upload: images, PDFs, etc.
          Hide
          Sam Ottenhoff added a comment -
          Can someone explain the original bug in the description or in the comments? The original description said "If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols." I am not able to replicate that bug in a 2.7.x instance. Can someone provide steps to replicate the original bug or explain what the original bug was?
          Show
          Sam Ottenhoff added a comment - Can someone explain the original bug in the description or in the comments? The original description said "If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols." I am not able to replicate that bug in a 2.7.x instance. Can someone provide steps to replicate the original bug or explain what the original bug was?
          Hide
          David Roldán Martínez added a comment -
          We've found that sometimes, when you uploaded a file (i.e., text or html) containing international characters (i.e España) they were not well displayed to the user. This was because ResourceProperties.PROP_CONTENT_ENCODING was set to UTF-8, independently from the real encoding of the file (i.e ISO-8859-1). That's why we use icu4j tools to detect file encoding and set ResourceProperties.PROP_CONTENT_ENCODING to the detected encoding.
          However, I've tried to reproduce it at qa1-nl a few minutes ago and I haven't been able to do it, though we have that problem in our 2.6 production environment.
          Show
          David Roldán Martínez added a comment - We've found that sometimes, when you uploaded a file (i.e., text or html) containing international characters (i.e España) they were not well displayed to the user. This was because ResourceProperties.PROP_CONTENT_ENCODING was set to UTF-8, independently from the real encoding of the file (i.e ISO-8859-1). That's why we use icu4j tools to detect file encoding and set ResourceProperties.PROP_CONTENT_ENCODING to the detected encoding. However, I've tried to reproduce it at qa1-nl a few minutes ago and I haven't been able to do it, though we have that problem in our 2.6 production environment.
          Hide
          Sam Ottenhoff added a comment -
          Hi David,

          How can I replicate this locally? How do I set the encoding of the file to ISO-8859-1?

          Show
          Sam Ottenhoff added a comment - Hi David, How can I replicate this locally? How do I set the encoding of the file to ISO-8859-1?
          Hide
          Sam Ottenhoff added a comment -
          Okay here are the steps to replicate this issue:

          1) Go to a course site -> Resources
          2) Upload the ejemplo.htm file attached to this JIRA
          3) Note that pre 2.8 Sakai instances would serve the file as UTF-8 even though the file is encoded as ISO-8859-1

          Sakai 2.8.0 should attempt to detect the encoding of the file and should correctly serve the file as ISO-8859-1.

          Ways to test are Firebug (check the response headers) or curl.
          Show
          Sam Ottenhoff added a comment - Okay here are the steps to replicate this issue: 1) Go to a course site -> Resources 2) Upload the ejemplo.htm file attached to this JIRA 3) Note that pre 2.8 Sakai instances would serve the file as UTF-8 even though the file is encoded as ISO-8859-1 Sakai 2.8.0 should attempt to detect the encoding of the file and should correctly serve the file as ISO-8859-1. Ways to test are Firebug (check the response headers) or curl.
          Hide
          Anthony Whyte added a comment -
          1.1.x r95856.
          Show
          Anthony Whyte added a comment - 1.1.x r95856.
          Hide
          Stephen Marquard added a comment -
          All the related issues need merging to 1.1.x. This change has caused several serious bugs which are fixed in the related issues.
          Show
          Stephen Marquard added a comment - All the related issues need merging to 1.1.x. This change has caused several serious bugs which are fixed in the related issues.
          Hide
          David Horwitz added a comment -
          all related fixes now merged
          Show
          David Horwitz added a comment - all related fixes now merged
          Hide
          Anthony Whyte added a comment -
          Changing the fix version to 1.1.13 to better express what version of kernel-1.1 contains the fixes that eliminate the regressions introduced by KNL-589.
          Show
          Anthony Whyte added a comment - Changing the fix version to 1.1.13 to better express what version of kernel-1.1 contains the fixes that eliminate the regressions introduced by KNL-589 .

            People

            • Assignee:
              David Horwitz
              Reporter:
              David Roldán Martínez
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: