Uploaded image for project: 'Kernel'
  1. Kernel
  2. KNL-589

Automatic file encoding detection

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.11
    • Fix Version/s: 1.1.14, 1.2.1
    • Component/s: Impl
    • Labels:
      None
    • 1.2 Status:
      Closed
    • CLE Team Issue:
      Yes
    • Previous Issue Keys:
      SAK-17558

      Description

      If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols. That's because charset is not set neither in ResourceProperties when you upload the file nor BaseContentService when serving the file. I attach the patch for our 2.6.x installation under kernel 1.0.12.

        Gliffy Diagrams

        1. content_encoding.txt
          4 kB
          David Roldán Martínez
        2. ejemplo.htm
          4 kB
          Sam Ottenhoff
        3. SAK-17558-kernel-DH.diff
          4 kB
          David Horwitz

          Issue Links

            Activity

            Hide
            David Horwitz added a comment -

            MAINTANCE TEAM: Unassigned so these get reviewed by the Maintance Team

            Show
            David Horwitz added a comment - MAINTANCE TEAM: Unassigned so these get reviewed by the Maintance Team
            Hide
            Anthony Vargo added a comment -

            2.7.x, r73739

            Show
            Anthony Vargo added a comment - 2.7.x, r73739
            Hide
            David Roldán Martínez added a comment -

            Hi David,

            We have found some issues in the patch I provided but I'm not sure if it fails only for resources stored before to apply the patch or not. If you create a new resource or replace an existing one, it works well.

            David

            Show
            David Roldán Martínez added a comment - Hi David, We have found some issues in the patch I provided but I'm not sure if it fails only for resources stored before to apply the patch or not. If you create a new resource or replace an existing one, it works well. David
            Hide
            David Roldán Martínez added a comment -

            The issues we found weren't related with this. The patch is 100 % functional.

            Show
            David Roldán Martínez added a comment - The issues we found weren't related with this. The patch is 100 % functional.
            Hide
            David Horwitz added a comment -

            This causes the regression described in SAK-18557 - the input stream is read past its beginning causing file corruption

            Show
            David Horwitz added a comment - This causes the regression described in SAK-18557 - the input stream is read past its beginning causing file corruption
            Hide
            David Horwitz added a comment -

            looking at the code this may be tricky to fix at this level - may be better to do this in the kernel

            Show
            David Horwitz added a comment - looking at the code this may be tricky to fix at this level - may be better to do this in the kernel
            Hide
            Anthony Whyte added a comment - - edited

            Reverted the fix in the 2.7.0 branch (David handled 2.7.x). I should note that the ibm dependency added r73739 has not been removed from the tool pom in either branch.

            <dependency>
            <groupId>com.ibm.icu</groupId>
            <artifactId>icu4j</artifactId>
            <version>4.2.1</version>
            </dependency>

            Show
            Anthony Whyte added a comment - - edited Reverted the fix in the 2.7.0 branch (David handled 2.7.x). I should note that the ibm dependency added r73739 has not been removed from the tool pom in either branch. <dependency> <groupId>com.ibm.icu</groupId> <artifactId>icu4j</artifactId> <version>4.2.1</version> </dependency>
            Hide
            Charles Hedrick added a comment - - edited

            Just to be clear, the problem is that the detector reads data. So if you pass it a stream, when it's finished the first 8000 bytes of the stream have been read. Not surprisingly, the file as uploaded is missing the first 8000 bytes.

            If you think this is important, you can do something like

            if (stream.markSupported()) {
            try

            { stream.mark(32); byte[32] buffer; stream.read(buffer); detector.setText(buffer) encoding = detector.detect().getName(); }

            catch (Exception e)

            { encoding = "UTF-8"; }

            finally

            { stream.reset(); }

            }

            where 32 is some number large enough for the detector. I'm not sure what this is. However you'd need to verify that this works if end of file occurs while you're reading the 32 bytes.

            Is there a better approach then heuristic detection? Does the browser set an encoding when uploading? Is this reliable enough to use? (Given the state of browsers I'm prepared to believe that it's not.)

            Show
            Charles Hedrick added a comment - - edited Just to be clear, the problem is that the detector reads data. So if you pass it a stream, when it's finished the first 8000 bytes of the stream have been read. Not surprisingly, the file as uploaded is missing the first 8000 bytes. If you think this is important, you can do something like if (stream.markSupported()) { try { stream.mark(32); byte[32] buffer; stream.read(buffer); detector.setText(buffer) encoding = detector.detect().getName(); } catch (Exception e) { encoding = "UTF-8"; } finally { stream.reset(); } } where 32 is some number large enough for the detector. I'm not sure what this is. However you'd need to verify that this works if end of file occurs while you're reading the 32 bytes. Is there a better approach then heuristic detection? Does the browser set an encoding when uploading? Is this reliable enough to use? (Given the state of browsers I'm prepared to believe that it's not.)
            Hide
            Stephen Marquard added a comment -

            What's the status of this fix? It seems to have been committed but then reverted from 2-7-x and 2.7.0. Is the fix wrong?

            Show
            Stephen Marquard added a comment - What's the status of this fix? It seems to have been committed but then reverted from 2-7-x and 2.7.0. Is the fix wrong?
            Hide
            David Roldán Martínez added a comment -

            The fix is OK. We have it in production environment since april 2010

            Show
            David Roldán Martínez added a comment - The fix is OK. We have it in production environment since april 2010
            Hide
            David Horwitz added a comment -

            A couple of notes:

            the attached patch doesn't match the UPV commit to their branch - that patch def doesn't work.

            The code in the UVP branch looks like it needs to be reviewed both for the reading error and in that it apears to load the file into memory

            Show
            David Horwitz added a comment - A couple of notes: the attached patch doesn't match the UPV commit to their branch - that patch def doesn't work. The code in the UVP branch looks like it needs to be reviewed both for the reading error and in that it apears to load the file into memory
            Hide
            Stephen Marquard added a comment -

            Reverted the trunk commit in r81578 as it causes regressions as reported in SAK-18996.

            Show
            Stephen Marquard added a comment - Reverted the trunk commit in r81578 as it causes regressions as reported in SAK-18996 .
            Hide
            David Horwitz added a comment -

            A better place to do this would be in the kernel in the commitResource method - we already have logic for handling this kind of spooling behavior (in the antivirus code). This fix will only check files uploaded in the resources and not via other methods (dav etc)

            Show
            David Horwitz added a comment - A better place to do this would be in the kernel in the commitResource method - we already have logic for handling this kind of spooling behavior (in the antivirus code). This fix will only check files uploaded in the resources and not via other methods (dav etc)
            Hide
            David Horwitz added a comment -

            This patch proposes a minimalist approach based on the code in the UPV branch that both avoids the spooling problem reported my chuck and incorporating his suggestion for only limited text being loaded into arrays to avoid problems identified earlier.

            David - would yo be able to help test this?

            Show
            David Horwitz added a comment - This patch proposes a minimalist approach based on the code in the UPV branch that both avoids the spooling problem reported my chuck and incorporating his suggestion for only limited text being loaded into arrays to avoid problems identified earlier. David - would yo be able to help test this?
            Hide
            David Horwitz added a comment -

            Tested and working ok.

            My concern is, what happens with content previously uploaded? Do you think it will be needed to run a script to update resource properties?

            David
            ________________________________________
            De: David Horwitz [david.horwitz@uct.ac.za]
            Enviado el: viernes, 17 de septiembre de 2010 12:33
            Para: DAVID ROLDAN MARTINEZ
            Asunto: SAK-17558 - file encoding detection

            Hi David,

            I have added a minimalist patch to SAK-17558 based on your work. Would
            you be able to help test/review this?

            Thanks

            David

            Show
            David Horwitz added a comment - Tested and working ok. My concern is, what happens with content previously uploaded? Do you think it will be needed to run a script to update resource properties? David ________________________________________ De: David Horwitz [david.horwitz@uct.ac.za] Enviado el: viernes, 17 de septiembre de 2010 12:33 Para: DAVID ROLDAN MARTINEZ Asunto: SAK-17558 - file encoding detection Hi David, I have added a minimalist patch to SAK-17558 based on your work. Would you be able to help test/review this? Thanks David
            Hide
            Sam Ottenhoff added a comment -

            I believe this issue may have resulted in the regression described in KNL-653 (email sent out to users despite setting notification to zero). Can the watchers on this ticket take a quick look at that issue?

            Show
            Sam Ottenhoff added a comment - I believe this issue may have resulted in the regression described in KNL-653 (email sent out to users despite setting notification to zero). Can the watchers on this ticket take a quick look at that issue?
            Hide
            Jean-François Lévêque added a comment -

            I cannot find anything obvious in the changes, Sam. Any other suspect?

            Show
            Jean-François Lévêque added a comment - I cannot find anything obvious in the changes, Sam. Any other suspect?
            Hide
            David Horwitz added a comment -

            The second update done when fixing the encoding was certainly using the wrong notification options. This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)

            Show
            David Horwitz added a comment - The second update done when fixing the encoding was certainly using the wrong notification options. This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)
            Hide
            Sam Ottenhoff added a comment -

            > This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)

            We could replicate the incorrect notifications with any type of file upload: images, PDFs, etc.

            Show
            Sam Ottenhoff added a comment - > This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8) We could replicate the incorrect notifications with any type of file upload: images, PDFs, etc.
            Hide
            Sam Ottenhoff added a comment -

            Can someone explain the original bug in the description or in the comments? The original description said "If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols." I am not able to replicate that bug in a 2.7.x instance. Can someone provide steps to replicate the original bug or explain what the original bug was?

            Show
            Sam Ottenhoff added a comment - Can someone explain the original bug in the description or in the comments? The original description said "If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols." I am not able to replicate that bug in a 2.7.x instance. Can someone provide steps to replicate the original bug or explain what the original bug was?
            Hide
            David Roldán Martínez added a comment -

            We've found that sometimes, when you uploaded a file (i.e., text or html) containing international characters (i.e España) they were not well displayed to the user. This was because ResourceProperties.PROP_CONTENT_ENCODING was set to UTF-8, independently from the real encoding of the file (i.e ISO-8859-1). That's why we use icu4j tools to detect file encoding and set ResourceProperties.PROP_CONTENT_ENCODING to the detected encoding.
            However, I've tried to reproduce it at qa1-nl a few minutes ago and I haven't been able to do it, though we have that problem in our 2.6 production environment.

            Show
            David Roldán Martínez added a comment - We've found that sometimes, when you uploaded a file (i.e., text or html) containing international characters (i.e España) they were not well displayed to the user. This was because ResourceProperties.PROP_CONTENT_ENCODING was set to UTF-8, independently from the real encoding of the file (i.e ISO-8859-1). That's why we use icu4j tools to detect file encoding and set ResourceProperties.PROP_CONTENT_ENCODING to the detected encoding. However, I've tried to reproduce it at qa1-nl a few minutes ago and I haven't been able to do it, though we have that problem in our 2.6 production environment.
            Hide
            Sam Ottenhoff added a comment -

            Hi David,

            How can I replicate this locally? How do I set the encoding of the file to ISO-8859-1?

            Show
            Sam Ottenhoff added a comment - Hi David, How can I replicate this locally? How do I set the encoding of the file to ISO-8859-1?
            Hide
            Sam Ottenhoff added a comment -

            Okay here are the steps to replicate this issue:

            1) Go to a course site -> Resources
            2) Upload the ejemplo.htm file attached to this JIRA
            3) Note that pre 2.8 Sakai instances would serve the file as UTF-8 even though the file is encoded as ISO-8859-1

            Sakai 2.8.0 should attempt to detect the encoding of the file and should correctly serve the file as ISO-8859-1.

            Ways to test are Firebug (check the response headers) or curl.

            Show
            Sam Ottenhoff added a comment - Okay here are the steps to replicate this issue: 1) Go to a course site -> Resources 2) Upload the ejemplo.htm file attached to this JIRA 3) Note that pre 2.8 Sakai instances would serve the file as UTF-8 even though the file is encoded as ISO-8859-1 Sakai 2.8.0 should attempt to detect the encoding of the file and should correctly serve the file as ISO-8859-1. Ways to test are Firebug (check the response headers) or curl.
            Hide
            Anthony Whyte added a comment -

            1.1.x r95856.

            Show
            Anthony Whyte added a comment - 1.1.x r95856.
            Hide
            Stephen Marquard added a comment -

            All the related issues need merging to 1.1.x. This change has caused several serious bugs which are fixed in the related issues.

            Show
            Stephen Marquard added a comment - All the related issues need merging to 1.1.x. This change has caused several serious bugs which are fixed in the related issues.
            Hide
            David Horwitz added a comment -

            all related fixes now merged

            Show
            David Horwitz added a comment - all related fixes now merged
            Hide
            Anthony Whyte added a comment -

            Changing the fix version to 1.1.13 to better express what version of kernel-1.1 contains the fixes that eliminate the regressions introduced by KNL-589.

            Show
            Anthony Whyte added a comment - Changing the fix version to 1.1.13 to better express what version of kernel-1.1 contains the fixes that eliminate the regressions introduced by KNL-589 .

              People

              • Assignee:
                David Horwitz
                Reporter:
                David Roldán Martínez
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Development