chat took down our system this evening. On some of the front ends, we found ClusterEventTracking at 100% and not much else happening. We also had continuing GCs, which I assume are related, though I can't prove it. Note that I'm just guessing that the following patch will solve the problem. It's an unambiguously good idea, but I find it hard to believe that this alone caused all the problems. As far as I can see, one ChatDelivery is created for each message, and I don't think we have that high a rate of message deliveries for this alone to kill us, but I've found that fixing the problems I know about makes it easier to find the ones I don't know about...
The problem was in org.sakaiproject.courier.impl.BasicCourierService.deliver(BasicCourierService.java:137). This code inserts items to be delivered (ChatDelivery objects) into a list for each destination. It checks to see if a given message is already in the list, using contains. Contains uses equal. One thinks of equal as low-overhead, but in this case it it not.
ChatDelivery objects have the message to be delivered. It is either the ID of a message or the actual object. Equals on ChatDelivery objects checks whether the underlying message ID's are equal. Unfortunately it does this using getMessage().getId(). This causes the message to be loaded if what you currently have is just the ID number. Since all we want is the ID number in the first place, this is unnecessary. The problem is that this code is called from inside the event handling systems's event delivery code. That code is single threaded, and has to be fast. It shouldn't be doing DB queries, particularly via Hibernate.
The solution is obvious. Add a method getMessageId which returns the ID if we have it, and does getId() if what we have is the object. So no DB queries.