click here for details... Sakai Executive Director Position Search now open
Issue Details (XML | Word | Printable)

Key: SAK-17013
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Zhen Qian
Reporter: Charles Hedrick
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Sakai

assignment download all problem generates 2 GB of output in catalina.out

Created: 23-Sep-2009 11:07   Updated: 01-Oct-2009 15:46
Component/s: Assignments
Affects Version/s: 2.6.0
Fix Version/s: 2.6.1

Time Tracking:
Not Specified

File Attachments: 1. Zip Archive bashrc.zip (0.5 kB)


2.6.x Status: Closed
2.5.x Status: None
2.4.x Status: None


 Description  « Hide
Found a thread in a loop generating error messages like this:

2009-09-22 11:24:59,371 WARN http-443-Processor486 org.sakaiproject.assignment.impl.BaseAssignmentService - org.sakaiproject.assignment.impl.DbAssignmentService@3f35fd95 zipSubmissions --IOException unable to create the zip file for userhwk 2/Hixson, Jeremy(kradakar)_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29
_30_31_32_33_34_35_36_37_38_39_40_41_42_43_44_45_46_47_48_49_50_51_52_53_54_55_56_57_58_59_
60_61_62_63_64_65_66_67_68_69_70_71_72_73_74_75_76_77_78_79_80_81_82_83_84_85_86_87_88_89_
90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_1
15_116_117_118_119_120_121_122_123_124_125_126_127_128_129_130_131_132_133_134_135_136_137_
138_139_140_141_142_143_144_145_146_147_148_149_150_151_152_153_154_155_156_157_158_159_16
0_161_162_163_164_165_166_167_168_169_170_171_172_173_174_175_176_177_178_179_180_181_182_
183_184_185_186_187_188_189_190_191_192_193_194_195_196_197_198_199_200_201_202_203_204_20
5_206_207_208_209_210_211_212_213_214_215_216^C

Looking at the code for generating zip files, I see a number of problems. If there's an I/O error writing to the ZIP file, it tries again, concatenating a number to the name. First, they probably intended to use _1, _2, etc, not _1, _1_2, _1_2_3, etc. Second the loop has no termination. Third, why are they doing this in the first place? The only obvious way to get an I/O error when writing to the network is if the network connection breaks. In that case trying again won't help.

I believe the correct fix is to abort on error. But if it isn't, then the name concatenation has to be fixed, and there has to be a limit on retries.

I consider this a critical error because it can result in a situation where you have to restart the app server.



 All   Comments   Work Log   Change History   Subversion Commits   git Commits      Sort Order: Ascending order - Click to sort in descending order
Zhen Qian added a comment - 23-Sep-2009 12:22
r 66766

I couldn't not remember why the looping was there in the first place. Doesn't seem right to me, either. I will remove it for now....

Sam Ottenhoff added a comment - 23-Sep-2009 12:26
Sorry to upgrade this to blocker, but this bug will have the end effect of losing student and faculty work. We have seen a catalina.out grow to 33gb in an hour. Once the disk is full, students will be unable to submit assignments as Sakai will attempt to write the upload to tmp space first.

Zhen Qian added a comment - 23-Sep-2009 12:34
Sam,

Was the bad looping inside assignment download_all process caused the problem in your instance? Is there a limit on catalina.out file size for daily basis?

Sam Ottenhoff added a comment - 23-Sep-2009 12:44
Hi Zhen, Yes, our catalina.out was full of the same text as Charles reported. We use logrotate to truncate the catalina.out on a daily basis, but if the catalina.out is able to grow without bound before logrotate runs, users will be unable to upload files once the disk is full.

Thanks for the quick fix. I will try to replicate and test on trunk.

Zhen Qian added a comment - 23-Sep-2009 13:18
Just to confirm that, we have a server failure here in UM last hour because of the same problem!

From Matthew Jones:

"Potpie going down looks to be related to 2009-09-23 15:38:30,027 [TP-Processor101] WARN org.sakaiproject.assignment.impl.BaseAssignmentService - org.sakaiproject.assignment.impl.DbAssignmentService@48262730 zipSubmissions --IOException unable to create the zip file for userIOE 265_ Lab 1/Chatlapalli . . .

At least that's the tail of the log. The file is too big to get off at the moment.

The error looks pretty crazy! (Heres a screenshot, the numbers go on until 11803). The log file is almost 2 gigabytes, with the other log files being in the double digit megabyte range."

Sam Ottenhoff added a comment - 23-Sep-2009 14:27
I have tested on trunk and Download All continues to work as expected. I have not been able to replicate issue on latest trunk.

Zhen Qian added a comment - 24-Sep-2009 07:33
merged r66766 into 2-6-x in r66794

Zhen Qian added a comment - 24-Sep-2009 10:01
Matthew Jones has provide a script that can be used for testing. Basically it slows down the http traffic. So for testing the download all feature, one can run the script from one's computer and force the IOException on zip downloading.

The instruction from Matthew is as follows:

"-Uncompress it sudo to root and move it to root's home directory.
(/var/root/.bashrc)

- Now you need to change roots shell to use bash (As root chsh, edit the line to say Shell: /bin/bash instead of Shell: /bin/sh)
- Source the file (source /var/root/.bashrc)
- Now these commands will be available either as root or via sudo <command>

It has a few commands currently in it that are active on that terminal (You can use 0 for the bit/s or you can use something with a kilo value [like 300k or 1500k])
#oraset <speed> = Sets oracle to a certain speed in bit/s
#httpset <speed> = Slows down http traffic (port 80 and 443) to speed
#orachange <speed> = Changes speed of oracle
#httpchange <speed> = Changes speed of http
#pipereset = Resets everything back to default"

Note: httpset 0 will actually unlimit the bit/s; so "httpset 1" should be good enough for slowing down the transfer.

The related bash file is attached

Charles Hedrick added a comment - 24-Sep-2009 12:25
The change should work. But it's not ideal. If a problem occurs, it stops output for that user but continues with the next. The problem is that you then have an incomplete ZIP file, but the user doesn't know it. I'd abort at the first failure, and produce an error that is visible to the user.

Zhen Qian added a comment - 24-Sep-2009 12:48
Based on the thread below, I will change the code to (1) abort the thread after the first IOException; and (2) generate a UI alert, same as the log message.

"On Sep 24, 2009, at 3:42 PM, Charles Hedrick wrote:

I agree it's likely that they can't. I still think it's safest to abort at the first error.

On Sep 24, 2009, at 3:34 PM, Zhen Qian wrote:

Chuck:

I think the problem happens most likely when the user has network connection problems. So do you think the user will be able to view the alert message within the tool?

Thanks,

- Zhen

Zhen Qian added a comment - 25-Sep-2009 08:22
So r66830 is to abort the zip process right after the first Exception. And a log message will be produced to address the error.

Currently the system is not able to generate an alert message to UI in case of this IO failure.