Affects Version/s: 19.2
Fix Version/s: None
We recently had major issues with a large concurrent assignment. A professor scheduled their midterm exam in the Assignments tool for almost 1000 students. The exam started at 9 am and ended at 12 pm, so the students all started at almost exactly the same moment. They were meant to download a zip file that was in the assignment instructions, make some changes, and submit it.
Students complained of the system being very slow and having trouble getting access to the document, waiting a long time while Sakai seemed to be loading, and never being able to retrieve the file at all. Apparently over half the students experienced these issues so it was a bit of a disaster. Many students later complained of having issues submitting, and seeing a stacktrace in the portal.
On our 3 clustered servers we have activated the database leakDetectionThreshold at 30s, and across the three server logs starting at exactly 9am we start seeing hundreds of "connection leak detected" messages (findAssignment-conn-leak.txt) all concerning the same method in Assignments, AssignmentRepositoryImpl.findAssignment() which is eventually called by AssignmentAction.doAccept_assignment_honor_pledge(). To be sure, the assignment is configured to present the honor pledge to students before allowing them to access it.
In talking to our DBAs, they confirmed that as of 9 am our active database session count exploded into the hundreds, as did the number of active sessions waiting for a lock to release. They generated a report for that period indicating that the number one SQL query by elapsed time was the following :
By activating the hibernate show_sql property I was able to confirm that this query is executed when a student accepts the honor pledge, I believe because a submission is generated at that time. If the honor pledge is disabled, this query is executed when the student submits. I think since the students were all accepting the honor pledge at the same time, they all had to wait for the others' locks to resolve causing a massive slow down and server load.
Another strange error was later seen many times in the logs: an UnresolvableObjectException.txt on AssignmentRepositoryImpl.deleteSubmission() in findSubmissionForUser(). It looks like the system finds >1 submission for the user and then tries to delete all but one, but the delete fails because the submission id is not found. I figure this might be from students closing their tab and trying again, and I assume some kind of dirty read is making the tool try to delete a submission that is not yet committed. Presumable this is caused by the previous issues. This is seen from AssignmentAction.doAccept_assignment_honor_pledge() as well as /vm/assignment/chef_assignments_list_assignments.vm, and was never seen before in our logs archive.