Uploaded image for project: 'Kernel'
  1. Kernel
  2. KNL-589

Automatic file encoding detection

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.11
    • Fix Version/s: 1.1.14, 1.2.1
    • Component/s: Impl
    • Labels:
      None
    • CLE Team Issue:
      Yes

      Description

      If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols. That's because charset is not set neither in ResourceProperties when you upload the file nor BaseContentService when serving the file. I attach the patch for our 2.6.x installation under kernel 1.0.12.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              dhorwitz David Horwitz added a comment -

              MAINTANCE TEAM: Unassigned so these get reviewed by the Maintance Team

              Show
              dhorwitz David Horwitz added a comment - MAINTANCE TEAM: Unassigned so these get reviewed by the Maintance Team
              Hide
              apvargo Anthony Vargo added a comment -

              2.7.x, r73739

              Show
              apvargo Anthony Vargo added a comment - 2.7.x, r73739
              Hide
              darolmar@abierta.upv.es David Roldán Martínez added a comment -

              Hi David,

              We have found some issues in the patch I provided but I'm not sure if it fails only for resources stored before to apply the patch or not. If you create a new resource or replace an existing one, it works well.

              David

              Show
              darolmar@abierta.upv.es David Roldán Martínez added a comment - Hi David, We have found some issues in the patch I provided but I'm not sure if it fails only for resources stored before to apply the patch or not. If you create a new resource or replace an existing one, it works well. David
              Hide
              darolmar@abierta.upv.es David Roldán Martínez added a comment -

              The issues we found weren't related with this. The patch is 100 % functional.

              Show
              darolmar@abierta.upv.es David Roldán Martínez added a comment - The issues we found weren't related with this. The patch is 100 % functional.
              Hide
              dhorwitz David Horwitz added a comment -

              This causes the regression described in SAK-18557 - the input stream is read past its beginning causing file corruption

              Show
              dhorwitz David Horwitz added a comment - This causes the regression described in SAK-18557 - the input stream is read past its beginning causing file corruption
              Hide
              dhorwitz David Horwitz added a comment -

              looking at the code this may be tricky to fix at this level - may be better to do this in the kernel

              Show
              dhorwitz David Horwitz added a comment - looking at the code this may be tricky to fix at this level - may be better to do this in the kernel
              Hide
              arwhyte Anthony Whyte added a comment - - edited

              Reverted the fix in the 2.7.0 branch (David handled 2.7.x). I should note that the ibm dependency added r73739 has not been removed from the tool pom in either branch.

              <dependency>
              <groupId>com.ibm.icu</groupId>
              <artifactId>icu4j</artifactId>
              <version>4.2.1</version>
              </dependency>

              Show
              arwhyte Anthony Whyte added a comment - - edited Reverted the fix in the 2.7.0 branch (David handled 2.7.x). I should note that the ibm dependency added r73739 has not been removed from the tool pom in either branch. <dependency> <groupId>com.ibm.icu</groupId> <artifactId>icu4j</artifactId> <version>4.2.1</version> </dependency>
              Hide
              hedrick Charles Hedrick added a comment - - edited

              Just to be clear, the problem is that the detector reads data. So if you pass it a stream, when it's finished the first 8000 bytes of the stream have been read. Not surprisingly, the file as uploaded is missing the first 8000 bytes.

              If you think this is important, you can do something like

              if (stream.markSupported()) {
              try

              { stream.mark(32); byte[32] buffer; stream.read(buffer); detector.setText(buffer) encoding = detector.detect().getName(); }

              catch (Exception e)

              { encoding = "UTF-8"; }

              finally

              { stream.reset(); }

              }

              where 32 is some number large enough for the detector. I'm not sure what this is. However you'd need to verify that this works if end of file occurs while you're reading the 32 bytes.

              Is there a better approach then heuristic detection? Does the browser set an encoding when uploading? Is this reliable enough to use? (Given the state of browsers I'm prepared to believe that it's not.)

              Show
              hedrick Charles Hedrick added a comment - - edited Just to be clear, the problem is that the detector reads data. So if you pass it a stream, when it's finished the first 8000 bytes of the stream have been read. Not surprisingly, the file as uploaded is missing the first 8000 bytes. If you think this is important, you can do something like if (stream.markSupported()) { try { stream.mark(32); byte[32] buffer; stream.read(buffer); detector.setText(buffer) encoding = detector.detect().getName(); } catch (Exception e) { encoding = "UTF-8"; } finally { stream.reset(); } } where 32 is some number large enough for the detector. I'm not sure what this is. However you'd need to verify that this works if end of file occurs while you're reading the 32 bytes. Is there a better approach then heuristic detection? Does the browser set an encoding when uploading? Is this reliable enough to use? (Given the state of browsers I'm prepared to believe that it's not.)
              Hide
              smarquard Stephen Marquard added a comment -

              What's the status of this fix? It seems to have been committed but then reverted from 2-7-x and 2.7.0. Is the fix wrong?

              Show
              smarquard Stephen Marquard added a comment - What's the status of this fix? It seems to have been committed but then reverted from 2-7-x and 2.7.0. Is the fix wrong?
              Hide
              darolmar@abierta.upv.es David Roldán Martínez added a comment -

              The fix is OK. We have it in production environment since april 2010

              Show
              darolmar@abierta.upv.es David Roldán Martínez added a comment - The fix is OK. We have it in production environment since april 2010
              Hide
              dhorwitz David Horwitz added a comment -

              A couple of notes:

              the attached patch doesn't match the UPV commit to their branch - that patch def doesn't work.

              The code in the UVP branch looks like it needs to be reviewed both for the reading error and in that it apears to load the file into memory

              Show
              dhorwitz David Horwitz added a comment - A couple of notes: the attached patch doesn't match the UPV commit to their branch - that patch def doesn't work. The code in the UVP branch looks like it needs to be reviewed both for the reading error and in that it apears to load the file into memory
              Hide
              smarquard Stephen Marquard added a comment -

              Reverted the trunk commit in r81578 as it causes regressions as reported in SAK-18996.

              Show
              smarquard Stephen Marquard added a comment - Reverted the trunk commit in r81578 as it causes regressions as reported in SAK-18996 .
              Hide
              dhorwitz David Horwitz added a comment -

              A better place to do this would be in the kernel in the commitResource method - we already have logic for handling this kind of spooling behavior (in the antivirus code). This fix will only check files uploaded in the resources and not via other methods (dav etc)

              Show
              dhorwitz David Horwitz added a comment - A better place to do this would be in the kernel in the commitResource method - we already have logic for handling this kind of spooling behavior (in the antivirus code). This fix will only check files uploaded in the resources and not via other methods (dav etc)
              Hide
              dhorwitz David Horwitz added a comment -

              This patch proposes a minimalist approach based on the code in the UPV branch that both avoids the spooling problem reported my chuck and incorporating his suggestion for only limited text being loaded into arrays to avoid problems identified earlier.

              David - would yo be able to help test this?

              Show
              dhorwitz David Horwitz added a comment - This patch proposes a minimalist approach based on the code in the UPV branch that both avoids the spooling problem reported my chuck and incorporating his suggestion for only limited text being loaded into arrays to avoid problems identified earlier. David - would yo be able to help test this?
              Hide
              dhorwitz David Horwitz added a comment -

              Tested and working ok.

              My concern is, what happens with content previously uploaded? Do you think it will be needed to run a script to update resource properties?

              David
              ________________________________________
              De: David Horwitz [david.horwitz@uct.ac.za]
              Enviado el: viernes, 17 de septiembre de 2010 12:33
              Para: DAVID ROLDAN MARTINEZ
              Asunto: SAK-17558 - file encoding detection

              Hi David,

              I have added a minimalist patch to SAK-17558 based on your work. Would
              you be able to help test/review this?

              Thanks

              David

              Show
              dhorwitz David Horwitz added a comment - Tested and working ok. My concern is, what happens with content previously uploaded? Do you think it will be needed to run a script to update resource properties? David ________________________________________ De: David Horwitz [david.horwitz@uct.ac.za] Enviado el: viernes, 17 de septiembre de 2010 12:33 Para: DAVID ROLDAN MARTINEZ Asunto: SAK-17558 - file encoding detection Hi David, I have added a minimalist patch to SAK-17558 based on your work. Would you be able to help test/review this? Thanks David
              Hide
              ottenhoff Sam Ottenhoff added a comment -

              I believe this issue may have resulted in the regression described in KNL-653 (email sent out to users despite setting notification to zero). Can the watchers on this ticket take a quick look at that issue?

              Show
              ottenhoff Sam Ottenhoff added a comment - I believe this issue may have resulted in the regression described in KNL-653 (email sent out to users despite setting notification to zero). Can the watchers on this ticket take a quick look at that issue?
              Hide
              jean-francois.leveque@upmc.fr Jean-François Lévêque added a comment -

              I cannot find anything obvious in the changes, Sam. Any other suspect?

              Show
              jean-francois.leveque@upmc.fr Jean-François Lévêque added a comment - I cannot find anything obvious in the changes, Sam. Any other suspect?
              Hide
              dhorwitz David Horwitz added a comment -

              The second update done when fixing the encoding was certainly using the wrong notification options. This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)

              Show
              dhorwitz David Horwitz added a comment - The second update done when fixing the encoding was certainly using the wrong notification options. This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)
              Hide
              ottenhoff Sam Ottenhoff added a comment -

              > This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8)

              We could replicate the incorrect notifications with any type of file upload: images, PDFs, etc.

              Show
              ottenhoff Sam Ottenhoff added a comment - > This certainly could have triggered updates in situations where Sakai corrected the file encoding (i.e. the attached file was not UTF8) We could replicate the incorrect notifications with any type of file upload: images, PDFs, etc.
              Hide
              ottenhoff Sam Ottenhoff added a comment -

              Can someone explain the original bug in the description or in the comments? The original description said "If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols." I am not able to replicate that bug in a 2.7.x instance. Can someone provide steps to replicate the original bug or explain what the original bug was?

              Show
              ottenhoff Sam Ottenhoff added a comment - Can someone explain the original bug in the description or in the comments? The original description said "If you create a file with non standard characters (for example: España) and try to access it, you will get strange symbols." I am not able to replicate that bug in a 2.7.x instance. Can someone provide steps to replicate the original bug or explain what the original bug was?
              Hide
              darolmar@abierta.upv.es David Roldán Martínez added a comment -

              We've found that sometimes, when you uploaded a file (i.e., text or html) containing international characters (i.e España) they were not well displayed to the user. This was because ResourceProperties.PROP_CONTENT_ENCODING was set to UTF-8, independently from the real encoding of the file (i.e ISO-8859-1). That's why we use icu4j tools to detect file encoding and set ResourceProperties.PROP_CONTENT_ENCODING to the detected encoding.
              However, I've tried to reproduce it at qa1-nl a few minutes ago and I haven't been able to do it, though we have that problem in our 2.6 production environment.

              Show
              darolmar@abierta.upv.es David Roldán Martínez added a comment - We've found that sometimes, when you uploaded a file (i.e., text or html) containing international characters (i.e España) they were not well displayed to the user. This was because ResourceProperties.PROP_CONTENT_ENCODING was set to UTF-8, independently from the real encoding of the file (i.e ISO-8859-1). That's why we use icu4j tools to detect file encoding and set ResourceProperties.PROP_CONTENT_ENCODING to the detected encoding. However, I've tried to reproduce it at qa1-nl a few minutes ago and I haven't been able to do it, though we have that problem in our 2.6 production environment.
              Hide
              ottenhoff Sam Ottenhoff added a comment -

              Hi David,

              How can I replicate this locally? How do I set the encoding of the file to ISO-8859-1?

              Show
              ottenhoff Sam Ottenhoff added a comment - Hi David, How can I replicate this locally? How do I set the encoding of the file to ISO-8859-1?
              Hide
              ottenhoff Sam Ottenhoff added a comment -

              Okay here are the steps to replicate this issue:

              1) Go to a course site -> Resources
              2) Upload the ejemplo.htm file attached to this JIRA
              3) Note that pre 2.8 Sakai instances would serve the file as UTF-8 even though the file is encoded as ISO-8859-1

              Sakai 2.8.0 should attempt to detect the encoding of the file and should correctly serve the file as ISO-8859-1.

              Ways to test are Firebug (check the response headers) or curl.

              Show
              ottenhoff Sam Ottenhoff added a comment - Okay here are the steps to replicate this issue: 1) Go to a course site -> Resources 2) Upload the ejemplo.htm file attached to this JIRA 3) Note that pre 2.8 Sakai instances would serve the file as UTF-8 even though the file is encoded as ISO-8859-1 Sakai 2.8.0 should attempt to detect the encoding of the file and should correctly serve the file as ISO-8859-1. Ways to test are Firebug (check the response headers) or curl.
              Hide
              arwhyte Anthony Whyte added a comment -

              1.1.x r95856.

              Show
              arwhyte Anthony Whyte added a comment - 1.1.x r95856.
              Hide
              smarquard Stephen Marquard added a comment -

              All the related issues need merging to 1.1.x. This change has caused several serious bugs which are fixed in the related issues.

              Show
              smarquard Stephen Marquard added a comment - All the related issues need merging to 1.1.x. This change has caused several serious bugs which are fixed in the related issues.
              Hide
              dhorwitz David Horwitz added a comment -

              all related fixes now merged

              Show
              dhorwitz David Horwitz added a comment - all related fixes now merged
              Hide
              arwhyte Anthony Whyte added a comment -

              Changing the fix version to 1.1.13 to better express what version of kernel-1.1 contains the fixes that eliminate the regressions introduced by KNL-589.

              Show
              arwhyte Anthony Whyte added a comment - Changing the fix version to 1.1.13 to better express what version of kernel-1.1 contains the fixes that eliminate the regressions introduced by KNL-589 .

                People

                • Assignee:
                  dhorwitz David Horwitz
                  Reporter:
                  darolmar@abierta.upv.es David Roldán Martínez
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  4 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Git Source Code