[SAK-9718] Quota Calculations cause all resources in a site to be loaded into memory, killing any put performance Created: 25-Apr-2007  Updated: 23-Oct-2008  Resolved: 26-Apr-2007

Status: Closed
Project: Sakai
Component/s: Content service (Pre-K1/2.6), WebDAV
Affects Version/s: 2.4.0, 2.4.1
Fix Version/s: 2.4.x, 2.5.0

Type: Bug Priority: Major
Reporter: Ian Boston Assignee: Unassigned
Resolution: Fixed Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depend
depends on SAK-3799 Finish move from CHEF XML-based stora... Closed

 Description   

When a resourceCommitEdit() is called, the quota calulation loads all the resources in a site into memory to calculate the quota.

This is Ok for small sites with 5 - 10 resources, but with larger sites where 100's of files have been uploaded it causes massive garbage collection and kills performance. It is particually bad with a webdav access where every put however big causes 100 or more getMembers() calls against evey collection in the site. (once per collection, so not cachable)

The Quota calculation should be maintained in 1 place only so it doent have to be re-calculated every time.

It might be worth a look at other filesystems with quota to see how its done there.



 Comments   
Comment by Ian Boston [ 25-Apr-2007 ]



If I turn quota off on the site, DAV becomes perfectly usable on large sites.

Comment by Peter A. Knoop [ 25-Apr-2007 ]

Bumped this to Blocker for discussion at today's Release Meeting. This seems like it could potentially be a wide-spread problem, depending on how common WebDAV use is for an implementation.

Comment by Seth Theriault [ 25-Apr-2007 ]

XFS (Linux) stores quota information in non-visible files and "internally":
http://www.die.net/doc/linux/man/man8/quotaon.8.html

This might be inspirational.

In addition, ProFTPD, an ftp replacement, has a module that implements quotas independently of the filesystem. Check ut:

http://www.castaglia.org/proftpd/modules/mod_quotatab.html#QuotaTables
http://www.castaglia.org/proftpd/doc/devel-guide/advanced/Quotatab/

It implements this via "limit" and "tally" tables, so entries are solely on FTP commands – a drawback. In Sakai's case, we totally control the "filesystem," this wouldn't be a limiting factor.

Comment by Peter A. Knoop [ 25-Apr-2007 ]

Ian, when you say kill performance, does this mean the app server overall is in trouble, causing problems for everyone else's sessions too? Do you know if this is also affects 2.3.1, or if this is a regression resulting from more recent changes for 2.4?

Comment by Jim Eng (Inactive) [ 25-Apr-2007 ]

Significant improvement on the efficiency of quota calculations (SAK-9718) depends on refactoring the database (SAK-3799).

Comment by Ian Boston [ 26-Apr-2007 ]

I have a fix that I will commit.

This is a temporary fix that only applies to the calculation of the total size of the collection and only when items are uploaded. I have tested it with 2 concurrent DAV sessions on 2 seperate sites works fine.

The process uses a concurrent hashmap to store a small object (2x longs) against the collection id of the site.

If no object is found for the site in question, a fill scan is performed as before, from that point on untill the object expires, the object is used rather than scanning the site.

The object contains the current size of the site, and a timestamp for the object to expire. As content is added or removed the size counter is updated.
When the object expires its is removed from the hashmap and a new scan is performed.

The object life time is set to 10 minutes between creation and expiry, and expiry scans are only performed when a found object expires or a new object is added.


I see the numer of GC lines reduce on a site with 3000 files in it from 15 per upload to 10-15 uploads per GC.

This is a temporary fix that resolves the issue for 2.4 (if included) but is not perfect as its in memory and does not synchronize between nodes in a server and requires 1 scan per site every 10 minutes when puts are being made to that site.

Comment by Ian Boston [ 26-Apr-2007 ]

This a temporary fix that works here and removed the GC problems and performance issues.

It does not change the database or have a massive scope. I have tested on a single node with multiple clients acessing he same site at the same time from different machines.

This fix does not address the wider issues and if it is acceptable to those concerned it could be included in 2.4 and the wider issues delayed until 2.5

Please discuss with the others.

Comment by Jim Eng (Inactive) [ 02-May-2007 ]

Here are questions raised by Glenn Golden in email:

Does this work in a cluster? When the "threads" are in different app servers?

This is further dangerous because it is a memory cache. It will have an entry for every site, and could grow large. Does it have a timeout value and a cleaning thread to keep the size down? Does it register with the memory service so that when the admin send the command to clear all caches, is correctly responds?

If we though that this was a good approach, a cache could be devised that worked in the cluster as well. But... I'm not sure this is worth it.

I'm also not sure why we are considering this for 2.4 at this late date. Maybe we are not.

Comment by Jim Eng (Inactive) [ 02-May-2007 ]

It would be great if we could include something that patches this for 2.4, but I think we need to address the questions Glenn raised before adopting a temporary fix that may be risky. I had asked about contention for the cache, and on review of various emails Ian answered that question to my satisfaction.

Comment by Megan May [ 02-May-2007 ]

From Ian Boston:

There is no issue with concurrent threads on the same app server, ConcurrentHashtable is used.

The cache is not communicated between app servers.

Each item in the cache uses 1 object+2 longs + 1 36 char string.

The items expire 10 minutes from the creation of the entry in the hashmap.

This is a temporary solution

Ian

Comment by Megan May [ 03-May-2007 ]

2.4.0.014 bound

Comment by Megan May [ 03-May-2007 ]

TESTING GUIDANCE
=====================================
For a single node there is a very simple test,
get 4 - 5 web dav sessions uploading to the same site at the same time, you might multiply that for a few sites.
Then do the same but lower the quota to make it go over quota.\
-----------
For a cluster, you need to repeat but with the sessions split between nodes.

(Preliminary testing of fix) I have done this for both clustered and non clustered for a 400MB data set of files ranging from 10K to 5M. - Ian

Comment by Andrew Poland (Inactive) [ 03-May-2007 ]

merged to 2-4-x r29918

Comment by Megan May [ 18-Jul-2007 ]

updating fix version to include 2.4.x

Comment by Peter A. Knoop [ 30-Aug-2007 ]

Trunk missing as fix version even though it was checked-in, so adding.

Generated at Mon Dec 18 01:11:20 CST 2017 using JIRA 7.5.0#75005-sha1:fd8c849d4e278dd8bbaccc61e707a716ad697024.