|
|
|
The changes to the MessageChannel API are committed:
Sending message-api/api/src/java/org/sakaiproject/message/api/MessageChannel.java Sending message-impl/impl/src/java/org/sakaiproject/message/impl/BaseMessageService.java Transmitting file data .. Committed revision 35188. I believe that at this point all further work can be done in the MailArchive service and tool.
This is the basic idea - Instead of pulling all messages into memory and sort/filtering them in memory (thus crushing a server), we will do this in SQL instead. However the problem is that the data model only has the message date broken out of the XML. The subject and from are in XML. The MailArchive tool will check to see how many messages are in the archive - below a pre-determined limit (say 1000 messages) it will behave as it currently does - pulling hte messages into memory and sort/filtering them in memory. Above the limit, the sort by subject and sort by from will be suppressed. What will still work is paging and search. Paging wil lbe done in SQL by wrapping the query and selecting a range or records from the row sets - many other Sakai services take this approach. Search will be done with a like clause set up to search from, subject, and body. This will do a full table scan but will not crush the app servers like pulling 25000 messages into memory and then sorting them in memory. You may not like this approach and you may suggest that more be done like refactor all of the data - and if *you* want to do that - you are welcome to do so. However it has been over a years since we have encountered this problem and we are still waiting for a solution from the rewrite attempts. This is a stopgap solution which drops in nicely. If this goes well - then the next step would be to add some columns to the database, pulling information from the XML and build a conversion and then make it so we can re-add the sort on from and subject. But perhaps by that point in time a complete rewrite will actually complete. Fixed to Email Archive
Upgraded to major Performance related The other performance issue is search indexing. If you have large mail archives, search indexing can also run you out of memory. It would be good to talk to Ian Boston to see how the email archive API could be made to work with the search indexing a bit better. Otherwise, large mail archives also require you to disable searches.
Good point - I will make that a config option.
On Sep 24, 2007, at 11:28 AM, Steven Githens wrote: This makes sense. You might want to consider just disabling sorting on Subject and From all together to make it consistent. The archive doesn't need to be terribly large before it starts really slowing things down. s the binary serialization in storage might reduce memory usage from 400K per entity to 4K per entity and 2ms per parse to 0.06ms per parse.
Add caching to that using the memory service and with a ehcache backend and you might get performance acceptable. I am not saying that doing things in columns is bad, infact its good, but the biggest problem with mail archive is the XML, followed shortly by loading too much data,. We also have a JSR-170 Mail archive implementation that we are intending to put into production here before christmas, and it scales. sakai-dev responds in < 1s Very very initial commit of the new work - very very rough. Just committing some setup code that will allow me to staet working on the SQL bits.
Transmitting file data . Committed revision 39631. This is committed in a branch - not trunk. I have some code that I would like to get some further testing done on. Te ideal situation would be where someone has access to a copy of production data on a server where they can install some test code from my branch and do some tests and tell me how things go. I need some performance feedback before I can finish it up.
This code is not ready to run at all - it is full of debug messages and little instrumentation bits. I would like some testing on MySql and Oracle if possible. The code for this is in message and in mailarchive https://source.sakaiproject.org/svn/mailarchive/branches/SAK-11544/ https://source.sakaiproject.org/svn/message/branches/SAK-11544/ These replace the mailarchive and message directories in the Sakai source directoies. - Recompile and deploy - voila. Test Scenario for this interim version - RUN this test and send me catalina.out
Here are the needed clicks - keep track of which ones are dog slow and which ones are fast. None should hang the system or suck up vast amounts of memory - some may be slow. Go to the mail tool in a site with lots of messages - do not enter a search value Press next page 2 times Press back page Press last page Press back page twice Press first page The above should all be blazingly fast - the rest might be slow - but should not crash your system. They may read throuh a lot of data in a record set but will not put it all into a big List. Put in a search string that is likely to be found in many messages and press search Press page ahead and page back (these may take a while because I have not optimized caching counts) Press last page (this might take a while) Press first page Put in a search string that is unlikely to be found (this might take a while) If you get some results back press page forward Send me back the catalina.out - thanks muchly I wrote this for another purpose (testing search) but it may be useful for anyone testing mailarchive. It just sends a bunch of email messages to the same address. Run it with
./makeupmail.pl 10000 somesite@sakai.domain to get a populated mail archive. http://source.cet.uct.ac.za/svn/sakai/scripts/trunk/loadtest/makeupmail.pl This a log associated with Tony's testing on collab/mySql.
I can see very clear progress, this run was pretty good. 1) Go to the mail tool in a site with lots of messages - do not enter a search value Wow! Less than 10 seconds for 24,936 messages. 2) Press next page 2 times About 5 seconds per on these. 3) Press back page Around 2 seconds on this one. 4) Press last page Around 5 seconds on this one. 5) Press back page twice Around 5 second per on these. 6) Press first page Around 2 seconds on this one 7) Put in a search string that is likely to be found in many messages and press search Around 3 minutes for the word "experience" (1549 results). 8) Press page ahead and page back Around 40 seconds for page ahead, around 35 for back. 9) Press last page Around 55 seconds for this one. 10) Press first page Around 40 seconds for this one. 11) Put in a search string that is unlikely to be found a couple minutes total for "beer", which returned 24 hits... :) 12) If you get some results back press page forward Around 2 seconds for this one. 13) Send me back the catalina.out See attached. I am going to make a fix for scenarios (8) and (10) - this requires me to make a change to BasicSqlService - so you will also need that code from this point forward. I made it so that it detected the when the BasicSQLService mod was not present and simply works somewhat less effiiciently.
So in addition to the branch for message and mailarchive - you need this as well https://source.sakaiproject.org/svn/db/branches/SAK-11544/ Checked out over top of your db directory. After this, there are no optimizations that can be done without data model change. Or at least none that I can think of. Avoiding the SAX overhead for search was pretty significant I think. Perhaps once we are done I should do one last check to see if my hand-parsing and decoding of the messages for purposes of search is faster than SAX. Stephen - thanks for makeupmail.pl - what I need though is something that calls web services and fills up the archive without going through send mail.
But your idea is much better than the hack I have in the MailArchive tool to insert messages. OK, here are the results of this pass, interspersed with the previous
results for comparison: On Feb 2, 2008 12:21 PM, <duhrer@gmail.com> wrote: 1) Go to the mail tool in a site with lots of messages - do not enter a search value Wow! Less than 10 seconds for 24,936 messages. About the same, but probably a second faster. 2) Press next page 2 times About 5 seconds per on these. About 4 seconds per on these. 3) Press back page Around 2 seconds on this one. Right about 2 seconds, but just short. 4) Press last page Around 5 seconds on this one. About the same. 5) Press back page twice Around 5 second per on these. About the same. 6) Press first page Around 2 seconds on this one About the same. 7) Put in a search string that is likely to be found in many messages and press search Around 3 minutes for the word "experience" (1549 results). Right about 1 minute for this one. 8) Press page ahead and page back Around 40 seconds for page ahead, around 35 for back. around 30 seconds for page ahead, around 25 for back 9) Press last page Around 55 seconds for this one. Around 50 seconds for this one 10) Press first page Around 40 seconds for this one. I got a tool error the first time I ran this ("java.sql.SQLException: Already closed.", see attached log). A tool restart threw the same error when repeating the initial search, so I had to restart. After a restart and repeat, it was around 30 seconds for this one. 11) Put in a search string that is unlikely to be found a couple minutes total for "beer", which returned 24 hits... :) Just over a minute this time. 12) If you get some results back press page forward Around 2 seconds for this one. Around 2 seconds for this one. Not sure what happened with the tool error, let me know if you think it's worth trying to test it to destruction. Tony Tony - thanks. The code to skip out of the SQLRead loop worked - but since the word experience was not early in the corpus - the performance improvement was not as much as I had hoped - but the code worked as expected so if there is a common word in the corpus so that it finds 20 hits early enough - it will run much faster.
There is nothing else to tune that I can thing of - unless we put a time limit on the search - this would actually be pretty easy to do - right in the filter. But the UI or this might be nasty. It would basically say - something like "search timed out". And we would need to propagate that information back up the call tree somehow and then show something in the UI to the user - and then we would probably need a button that said - do a long search - without the timeout. By the time all that UI stuff were done - I should probably re-do the data model. Perhaps all that should be done reasonable is put a timeout like 30 seconds in and put up a message like "not all records were searched - time limit exceeded". Thoughts? Chuck, don't know if it would be easy or not to grab the "in progress..." code from another place in Sakai, but perhaps you could use it for when someone does a Search?
Peter - that is on my lis of things to do. Thanks for the reminder.
I don't think there's any value to timing out the search. Performance-wise, the query has been launched, so as far as the impact on the db is concerned, the damage has been done.
Long-term, searching anything using LIKE in db queries is just bad, and we should get rid of it in favour of integration with the search service. Chuck,
http://source.cet.uct.ac.za/svn/sakai/scripts/trunk/loadtest/EmailArchive.jws Call with it http://source.cet.uct.ac.za/svn/sakai/scripts/trunk/loadtest/mailarchive.pl It is certainly a lot faster than sending them in via smtp. On Thu, 7 Feb 2008, csev wrote:
I need MailArchive tool to put up a "Searching..." once someone hits the "Search" button. Could you do this? I assume that all you need to change is the vm files? No rush. /Chuck r41096, r41097 http://bugs.sakaiproject.org/jira/browse/SAK-12929 I did a few more minor things while I was there. -Gonzalo Oracle support is completed.
Initial back-port to 2.4 - for testing is completed. Instructions for testing Check out a copy of 2-4-x Remove the directories: db message mailarchive rm -rf db message mailarchive svn co https://source.sakaiproject.org/svn/message/branches/SAK-11544-post-2-4/ message svn co https://source.sakaiproject.org/svn/mailarchive/branches/SAK-11544-post-2-4/ mailarchive svn co https://source.sakaiproject.org/svn/db/branches/SAK-11544-post-2-4/ db Compile and deploy with Maven as usual. Stephen Marquad pointed out that search re-indexing needs to be fixed as well or a reindex crashes the server. Tony Atkins pointed this out as well.
Stephen provided a nice traceback - below. Hi Ian,
I want to run a search index build test with a large number of items (500K), and the easiest way to do this is email archive. So I ran a script that called a .jws that populated an email archive with 500K items, so far so good, then tried a global index rebuild, which seems to want to load all the messages into memory for the search adapter. Do your changes to email archive improve this situation? If so could you merge them into trunk? :-) Otherwise I will have to restructure to create 100 sites with 5000 messages each or something. Cheers Stephen "Timer-2" daemon prio=1 tid=0x00002aaab0d878c0 nid=0x1322 runnable [0x0000000042350000..0x0000000042351ca0] at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.createChunk(DeferredDocumentImpl.java:1966) at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.ensureCapacity(DeferredDocumentImpl.java:1858) at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.createNode(DeferredDocumentImpl.java:1876) at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.createDeferredDocument(DeferredDocumentImpl.java:225) at com.sun.org.apache.xerces.internal.parsers.AbstractDOMParser.startDocument(AbstractDOMParser.java:845) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.startDocument(XMLDTDValidator.java:701) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.startEntity(XMLDocumentScannerImpl.java:540) at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.startDocumentParsing(XMLVersionDetector.java:170) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:250) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:292) at org.sakaiproject.util.Xml.readDocumentFromString(Xml.java:160) at org.sakaiproject.util.BaseDbDoubleStorage.readResource(BaseDbDoubleStorage.java:675) at org.sakaiproject.util.BaseDbDoubleStorage.getAllResources(BaseDbDoubleStorage.java:769) at org.sakaiproject.mailarchive.impl.DbMailArchiveService$DbStorage.getMessages(DbMailArchiveService.java:275) at org.sakaiproject.message.impl.BaseMessageService$BaseMessageChannelEdit.findMessages(BaseMessageService.java:3122) at org.sakaiproject.message.impl.BaseMessageService$BaseMessageChannelEdit.findFilterMessages(BaseMessageService.java:3213) at org.sakaiproject.message.impl.BaseMessageService$BaseMessageChannelEdit.getMessages(BaseMessageService.java:2450) at org.sakaiproject.search.component.adapter.message.MessageContentProducer$1.nextIterator(MessageContentProducer.java:517) at org.sakaiproject.search.component.adapter.message.MessageContentProducer$1.hasNext(MessageContentProducer.java:492) at org.sakaiproject.search.indexer.impl.SearchBuilderQueueManager.rebuildIndex(SearchBuilderQueueManager.java:840) at org.sakaiproject.search.indexer.impl.SearchBuilderQueueManager.findPendingAndLock(SearchBuilderQueueManager.java:305) at org.sakaiproject.search.indexer.impl.SearchBuilderQueueManager.open(SearchBuilderQueueManager.java:224) at org.sakaiproject.search.transaction.impl.IndexTransactionImpl.fireOpen(IndexTransactionImpl.java:359) at org.sakaiproject.search.transaction.impl.IndexTransactionImpl.open(IndexTransactionImpl.java:75) at org.sakaiproject.search.indexer.impl.TransactionIndexManagerImpl.openTransaction(TransactionIndexManagerImpl.java:80) at org.sakaiproject.search.indexer.impl.TransactionIndexManagerImpl.openTransaction(TransactionIndexManagerImpl.java:44) at org.sakaiproject.search.indexer.impl.TransactionalIndexWorker.process(TransactionalIndexWorker.java:123) at org.sakaiproject.search.indexer.impl.ConcurrentSearchIndexBuilderWorkerImpl.runOnce(ConcurrentSearchIndexBuilderWorkerImpl.java:265) at org.sakaiproject.search.journal.impl.IndexManagementTimerTask.run(IndexManagementTimerTask.java:135) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Stephen Marquard, Learning Technologies Co-ordinator Centre for Educational Technology, University of Cape Town http://www.cet.uct.ac.za Email/IM/XMPP: stephen.marquard@uct.ac.za Phone: +27-21-650-5037 Cell: +27-83-500-5290 Code for the post-2-5 / trunk-patch version is now checked in as:
https://source.sakaiproject.org/svn/search/branches/SAK-11544/ Simply replace the search folder with this I still need to back port this to 2.4 and build a post Instructions for patching 2.5 or trunk
rm -rf db message mailarchive search svn co https://source.sakaiproject.org/svn/message/branches/SAK-11544/ message svn co https://source.sakaiproject.org/svn/mailarchive/branches/SAK-11544/ mailarchive svn co https://source.sakaiproject.org/svn/db/branches/SAK-11544/ db svn co https://source.sakaiproject.org/svn/search/branches/SAK-11544/ search Committed to trunk:
charles-severances-macbook-air:sakai csev$ cd announcement/ charles-severances-macbook-air:announcement csev$ svn commit Sending announcement-impl/impl/src/java/org/sakaiproject/announcement/impl/DbAnnouncementService.java Transmitting file data . Committed revision 41783. charles-severances-macbook-air:announcement csev$ cd .. charles-severances-macbook-air:sakai csev$ cd chat charles-severances-macbook-air:chat csev$ svn commit Sending chat-impl/impl/src/java/org/sakaiproject/chat/impl/DbChatService.java Transmitting file data . Committed revision 41784. charles-severances-macbook-air:chat csev$ cd .. charles-severances-macbook-air:sakai csev$ cd db charles-severances-macbook-air:db csev$ svn commit Sending db-util/storage/src/java/org/sakaiproject/util/BaseDbDoubleStorage.java Sending db-util/storage/src/java/org/sakaiproject/util/DoubleStorageSql.java Sending db-util/storage/src/java/org/sakaiproject/util/DoubleStorageSqlDefault.java Sending db-util/storage/src/java/org/sakaiproject/util/DoubleStorageSqlHSql.java Sending db-util/storage/src/java/org/sakaiproject/util/DoubleStorageSqlMySql.java Sending db-util/storage/src/java/org/sakaiproject/util/DoubleStorageSqlOracle.java Transmitting file data ...... Committed revision 41785. charles-severances-macbook-air:db csev$ cd .. charles-severances-macbook-air:sakai csev$ cd mailarchive/ charles-severances-macbook-air:mailarchive csev$ svn commit Sending mailarchive-impl/impl/src/java/org/sakaiproject/mailarchive/impl/DbMailArchiveService.java Sending mailarchive-tool/tool/src/java/org/sakaiproject/mailarchive/tool/MailboxAction.java Transmitting file data .. Committed revision 41786. charles-severances-macbook-air:mailarchive csev$ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Add the getCount() and getPagedMessages() methods to the message API and
add simple versions of these methods to the Base class.
Currently no tools use these methods and for efficiency they should
be overridden in the real implemntations. If tools
do start to use these methods without their services overriding the
methods they will work as well as the rest of the implementations
in the Base class.
The first consumer of these will likely be the Mail Archive tool and service.
This is not a complete solution - it is getting the APIs right so that
the implementations can be built in Mail Archive to do this more
efficiently.