Lucene Extensions and Issues

cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene Extensions and Issues

resplin
Intermediate
0 0 1,825

{{Obsolete}}

Alfresco replaced Lucene with Solr in 4.0, and in 5.0.b Lucene has been removed.

This page accumulates the outstanding issues we have with Lucene, extensions to Lucene and our work arounds and solutions.

We are currently based on the 1.4.3 release of Lucene. All our modifications are backwards compatible with this release. The modified source zip is available in the distribution so you can see these changes. As we have one API extension we can not use the base 1.4.3 and still have the issues below.

NOTE: This is somewhat dated information. The version of Lucene used in Alfresco CE 3.3 and 3.4 is Lucene 2.4.1. The information below may or may not represent current problems.


Lucene's use of thread locals


Stress testing showed the memory size of the repository to slowly increase over time.
Eventually giving an out of memory error. The only objects that were accumulating were lucene thread locals. Removing them fixed the issue.

Lucene uses thread locals in a number of places. Objects held on thread locals are garbage collected in an ad hoc way. Lucene is essentially caching information. This should be held on thread locals as SoftReferences so unused cached information may be garbage collected if required.

Others also use thread locals. The interaction of lucene with other thread local users seems to be the issue. The issue can be shown to hold memory in lucene alone but does not give the monotonic increase in memory use. I have yet to mimic the isuue in a simple test.

It could be argued we are making too many Indexreaders when they could be reused. We have a task to do this. This would mitigate this issue but not resolve it.

See LUCENE-529 and the discussion related to this in the Lucene developer mailing list.

We have fixed this to clear the thread local on close.
We will fix this to use soft references. I have been testing this before adding it.




Standard tokenisation of numeric and date fields


Submitted our approach for this as a contribution. Referred here for how it is done in the spin off. We could use this but it would require a patch to update the tokens we hold in the index.

See LUCENE-530



Here is how Solr did it:
http://svn.apache.org/viewcvs.cgi/incubator/solr/trunk/src/java/org/apache/solr/util/NumberUtils.java?rev=382610&view=markup

It's a binary representation transformed to sort correctly and fit in to chars.
A 4 byte int or float is transformed into 3 java chars
An 8 byte long or double is transformed into 5 java chars




Index corruption


If the JVM crashes or it is terminated during the writing of a segment the new segment is left around. When the index next writes a new segment it uses this existing segment. This leads to a corrupted segment. A clean/empty files should be used each time.

See LUCENE-415

We have fixed this to zero the file length of new segement files.  

They could check if the segment file already exists.




Stale file handles under windows


There was an issue with creating new segement files.
Finding stale file handles when writing to a file we should have created.
This does not happen if we immediately open a channel to the file.

See LUCENE-415

Fixed by creating a channel immediately.




Lock File may not be deleted


The Lock implementation uses file.delete() and does not check the lock file is cleaned up.
This can leave a lock file around.

NOT YET RAISED

We check the file is deleted and retry. If it is not deleted in reasonable time we throw an error. We then know we left the lock file.




Lock file is not appropriate to signal IPC


See the Java Doc regarding File.create() and its use for IPC/signalling.

This mechanism is not reliable. We reuse lock object with synchronisation.
This means in-JVM locks are secure. Inter-process locks will not be safe. This also
requires a channel level lock. WE DO NOT SUPPORT INTER-PROCESS LOCKS. All clients must implement the same lock mechanism.



Others have raised this or related issues and seem to have been ignored.




Lock mechamism could reuse objects


Many lock object are created. We do not create many but register them with the single instance of FSDirectory and use them for lock synchronisation.

NOT YET RAISED




IndexerReader.exist() can report incorrect results


This tests for the segment file existing. During index commit this file may not exist as the new file is copied. It seems it is not atomic. So the index is reported as not existing and ew try to create a new one. All sorts of errors as the file exists and can not be deleted. It is possible, (and has happened ojnce) that we would delete the index as a result. This arises fro the semantics of IndexWriter and creaing indexes.

Lucene should use a simple file for index exists that never changes.

Seen once using FTP to upload many files.

NOT YET RAISED


Combining indexes is expensive


The target index is optimised before and after the standard merge.
This is very expensive. We want to be able to control index optimisation.
We would like an algorithm similar to adding documents to the index.
Ideally this would produce a compromise of read speed and speed to merge an index.
(This is how we do transactional updates to the index)



Others have raised this and been ignored.



We have extended the IndexWriter to merge indexes in a way more suitable for us.

This is the only issue that stops us being backward compatible with 1.4.3. (Unless we test for the extended method being supported using reflection)

Search