click here for details... Sakai Executive Director Position Search now open
Issue Details (XML | Word | Printable)

Key: SAK-13752
Type: Task Task
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Ian Boston
Reporter: Adam Marshall
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Sakai

update search to use POI 3.5 to index OOXML docs (docx, etc.)

Created: 13-Jun-2008 08:43   Updated: 28-Sep-2009 16:06
Component/s: Search
Affects Version/s: 2.5.2
Fix Version/s: 2.6.0, 2.7.0

Time Tracking:
Not Specified

Issue Links:
Relate
 

2.6.x Status: Closed
2.5.x Status: None
2.4.x Status: None


 Description  « Hide
POI v3.1 (or is it 3.5?) is due for release during summer 08, it will support the new office open XML formats.

Indexing such documents in 2.6 would be rather spiffing.

 All   Comments   Work Log   Change History   Subversion Commits   git Commits      Sort Order: Ascending order - Click to sort in descending order
Stephen Marquard added a comment - 10-Jan-2009 06:59 - edited
Seems we could bump poi and pdfbox to latest versions. Not sure if poi-ooxml will require code changes to support OOXML digesting.

Index: search-impl/impl/pom.xml
===================================================================
--- search-impl/impl/pom.xml (revision 56007)
+++ search-impl/impl/pom.xml (working copy)
@@ -155,21 +155,27 @@
       <type>jar</type>
     </dependency>
     <dependency>
- <groupId>poi</groupId>
+ <groupId>org.apache.poi</groupId>
       <artifactId>poi</artifactId>
- <version>3.0-alpha3-20070301</version>
+ <version>3.5-beta4</version>
       <type>jar</type>
     </dependency>
     <dependency>
- <groupId>poi</groupId>
+ <groupId>org.apache.poi</groupId>
+ <artifactId>poi-ooxml</artifactId>
+ <version>3.5-beta4</version>
+ <type>jar</type>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.poi</groupId>
       <artifactId>poi-scratchpad</artifactId>
- <version>3.0-alpha3-20070301</version>
+ <version>3.5-beta4</version>
       <type>jar</type>
     </dependency>
     <dependency>
       <groupId>pdfbox</groupId>
       <artifactId>pdfbox</artifactId>
- <version>0.7.1</version>
+ <version>0.7.3</version>
       <type>jar</type>
     </dependency>
     <dependency>

Stephen Marquard added a comment - 10-Jan-2009 07:50
Looks like to get .docx etc. digesting to work may require a new digester to use the poi code for this (as opposed to textmining for word docs). Another issue is that testing with FF3 on Mac, uploading a .docx into Resources results in a mime type of application/binary rather than application/msword, so the search digester should probably check the file extension in addition to the MIME type.

Stephen Marquard added a comment - 10-Jan-2009 12:26
Another library for OOXML:
http://www.openxml4j.org/


Stephen Marquard added a comment - 12-Jan-2009 13:00
Initial support in r56076

David Horwitz added a comment - 14-Jan-2009 03:54
tested on local build

Anthony Whyte added a comment - 28-Sep-2009 16:06
in 2.6.0.