Affects Version/s: 2.9.x
Fix Version/s: 10.0
Previous Issue Keys:KNL-1011
There are various, rather serious performance issues with Sites at scale (size and usage). They are exposed on very common actions such as logging in (initial site visit, not actual login), all site visits, and tool requests for those that can merge content (announcements, schedule, etc.).
The most fundamental issue is that SiteService.getSites is called in very many places throughout the code base to retrieve the list of all sites for the current user, and this is never cached. On some basic requests like site visits, this is called upwards of three or four times by the portal (for finding the user's default site [after SAK-22386], calculating tabs, subsites, and so on).
At small scale (few concurrent users, few sites per user, short descriptions), the penalty is not obvious. However, it can be dramatic if, for example, users belong to hundreds of sites or site descriptions are long (as they can be easily by way of pasting from Word, for example). The long site descriptions are not the problem in and of themselves; they are just retrieved far too often and on operations that are too fundamental to any other activity.
The getSites operation should be optimized to provide a way to retrieve sites without costly data not relevant to how they will be used, as in the list of site titles used for rendering navigational tabs. There should also be some caching to account for other areas of the code base that may call getSites repeatedly.
Upon investigation, it was also found that the portal spends significant time (equivalent to all duplicative retrieval in load testing) escaping site information for HTML. This occurs regardless of any caching since it is at the time of viewing. This HTML-safe text should be precalculated and exposed as part of the site information.
The University of Michigan has invested significant energy in profiling and analyzing these issues after observing some performance degradation. In the original scenario, there was also some unknown degradation of connections within the database pool. Specifically, borrowed connections were transferring CLOB data (site descriptions, announcement bodies, others) extremely slowly, while new connections created on the affected servers performed at expected rates. It is not clear what the exact trigger was for this condition, but the pool (very outdated DBCP) or long-established database sessions are likely involved. This degradation not yet been reproduced outside of the original window, so energy has been directed on resolving the glaring efficiency issues in SiteService and Portal.
SiteHandler goes through SiteNeighbourhoodServiceImpl.getAllSites, which retrieves all accessible sites regardless of the active context (sites, tabs, or other) because getSitesAtNode delegates unconditionally.
- Leave SiteService.getSites behavior unchanged for existing calls (except potentially faster because of caching)
- Add a SiteService.getSites signature to retrieve records without processing descriptions
- Implement a four-stage approach in SiteService.getSites: get IDs, check cache for each; query by ID for only those uncached, cache new
- For sites loaded without descriptions, lazily load descriptions on loadAll and getSite (which retrieves all pages, etc. and caches a single site fully)
- Add a new SiteService method, getUserSites that caches the complete list of sites for a user
- Call getUserSites from SiteNeighbourhoodServiceImpl.getAllSites to take advantage of cache
- Call getUserSites from MergedList in site/mergedlist-util (used for announcements, etc.)
- Add new methods to Site, getHtmlShortDescription and getHtmlDescription to escape and retain them internally
- Call the new HTML-safe description getters from PortalSiteHelperImpl instead of Web.escapeHtml on the plaintext descriptions