The Alfresco Content Tracker

cancel
Showing results for 
Search instead for 
Did you mean: 

The Alfresco Content Tracker

a_gazzarini
Active Member
11 7 10.1K

Recently I have been involved in an investigation task about the Alfresco Tracker Subsystem, with a specific focus on the Content Tracker. This post is the output of the analysis, which can also be found in the SearchServices repository, under the documentation folder.   

The ContentTracker is part of the Alfresco Tracker Subsystem which is composed by the following members:

  • ModelTracker: for listening on model changes
  • ContentTracker: described in this post
  • MetadataTracker: for tracking changes in metadata nodes
  • AclTracker: for listening on ACLs changes
  • CascadeTracker: which manages cascade updates (i.e. updates related with the nodes hierarchy)
  • CommitTracker: which provides commit and rollback capabilities

Each Solr that composes your search infrastructure (regardless it holds a monolithic core or a shard) has a singleton instance of each tracker type, which is registered, configured and scheduled at SolrCore startup. The SolrCore holds a TrackerRegistry for maintaining a list of all "active" tracker instances.

The ContentTracker queries for documents with "unclean" content (i.e. data whose content has been modified in Alfresco), and then updates them. Periodically, at a configurable frequency, the ContentTracker checks for transactions containing nodes that have been marked as "Dirty" (changed) or "New". Then,

  • it retrieves the cached version of that data from the ContentStore
  • it retrieves the corresponding (text) content from Alfresco
  • it updates the ContentStore
  • it re-indexes the data in the hosting Solr instance

Later, the CommitTracker will persist those changes. 

The Tracking Subsystem

The class diagram below provides a high-level overview about the main classes involved in the Tracker Subsystem.

As you can see from the diagram, there's an abstract interface definition (Tracker) which declares what is expected by a Tracker and a Layer Supertype (AbstractTracker) which adopts a TemplateMethod [1] approach. It provides the common behavior and features inherited by all trackers, mainly in terms of:

  • Configuration
  • State definition (e.g. isSlave, isMaster, isInRollbackMode, isShutdown)
  • Constraints (e.g. there must be only one running instance of a given tracker type in a given moment)
  • Locking: the two Semaphore instances depicted in the diagram used for a) implementing the constraint described in the previous point b) providing an inter-trackers synchronisation mechanism.

The Tracker behavior is defined in the track() method that each tracker must implement. As said above, the AbstractTracker forces a common behaviour on all trackers by declaring a final version of that method, and then it delegates to the concrete trackers (subclasses) the specific logic by requiring them the implementation of the doTrack() method.

Each tracker is a stateful object which is initialized, registered in a TrackerRegistry and scheduled at startup in the SolrCoreLoadRegistration class. The other relevant classes depicted in the diagram are:

  • SolrCore: the dashed dependency relationship means that a Tracker doesn't hold a stable reference to the SolrCore: it obtains that reference each time it's needed.
  • ThreadHandler: The ThreadExecutionPool manager which holds a pool of threads needed for scheduling asynchronous tasks (i.e. unclean content reindexing)
  • TrackerState: being a shared instance across all trackers, it would have been called something like TrackersState or TrackerSubsystemState. It is used for holding the trackers state (e.g. lastTxIdOnServer, trackerCycles, lastStartTime)
  • TrackerStats: maintains a global stats about all trackers. Following the same approach of the TrackerState, it is a shared instance and therefore the name is a little bit misleading because it is related to all trackers
  • SOLRAPIClient: this is the HTTP proxy / facade towards Alfresco REST API: in the sequence diagrams these interactions are depicted in green
  • SolrInformationServer: The Solr binding for the InformationServer interface, which defines the abstract contract of the underlying search infrastructure

Startup and Shutdown

The Trackers startup and registration flow is depicted in the following sequence diagram:

Solr provides, through the interface SolrEventListener, a notification mechanism for registering custom plugins during a SolrCorelifecycle. The Tracker Subsystem is initialized, configured and scheduled in the SolrCoreLoadListener which delegates the concrete work to SolrCoreLoadRegistration. Here, a new instance of each tracker is created, configured, registered and then scheduled by means of a Quartz Scheduler. Trackers can share a common frequency (as defined in the alfresco.cronproperty) or they can have a specific configuration (e.g. alfresco.content.tracker.cron).

The SolrCoreLoadRegistration also registers a shutdown hook which makes sure all registered trackers will follow the same hosting SolrCore lifecycle.

Content Tracking

The sequence diagram below details what happens in a single tracking task executed by the ContentTracker: 

At a given frequency (which again, can be the same for each tracker or overriden per tracker type) the Quartz Scheduler invokes the doTrack() method of the ContentTracker. Prior to that, the logic in the AbstractTracker is executed following the TemplateMethod [1] described above; specifically the "Running" lock is acquired and the tracker is put in a "Running" state.

Then the ContentTracker does the following:

  • get documents with "unclean" content
  • if that list is not empty, each document is scheduled (asynchronously) for being updated, in the content store and in the index

In order to do that, the ContentTracker never uses directly the proxy towards ACS (i.e. the SOLRAPIClient instance); instead, it delegates that logic to the SolrInformationServer class. The first step (getDocsWithUncleanContent) searches in the local index all transactions which are associated to documents that have been marked as "Dirty" or "New". The field where this information is recorded is FTSSTATUS; it could have one of the following values:

  • Dirty: content has been updated / changed
  • New: content is new
  • Clean: content is up to date, there's no need to refresh it

The "Dirty" documents are returned as triples containing the tenant, the ACL identifier and the DB identifier.

NOTE: this first phase uses only the local Solr index, no remote call is involved.

If the list of Tenant/ACLID/DBID triples is not empty, that means we need to fetch and update the text content of the corresponding documents. In order to do that, each document is wrapped in a Runnable object and submitted to a thread pool executor. That makes each document content processing asynchronous.

The ContentIndexWorkerRunnable, once executed, delegates the actual update to the SolrInformationServer which, as said above, contains the logic needed for dealing with the underlying Solr infrastructure; specifically:

  • the document that needs to be refreshed, uniquely identified by the tenant and the db identifier, is retrieved from the local content store. In case the cached document cannot be found in the content store, the /api/solr/metadata remote API is contacted in order to rebuild the document (only metadata) from scratch.
  • the api/solr/textContent is called in order to fetch the text content associated with the node, plus the transformation metadata (e.g, status, exception, elapsed time)
  • if the alfresco.fingerprint configuration property is set to true and the retrieved text is not empty the fingerprint is computed and stored in the MINHASH field of the document
  • the content fields are set
  • the document is marked as clean (i.e. FTSSTATUS = "Clean") since its content is now up to date
  • the cached version is overwritten in the content store with the up to date definition
  • the document (which is a SolrInputDocument instance) is indexed in Solr

Rollback

The Rollback sequence diagram illustrates how the rollback process works:

The commit/rollback process is a responsibility of the CommitTracker, so the ContentTracker is involved in these processes only indirectly.

When it is executed, the CommitTracker acquires the execution locks from the MetadataTracker and the AclTracker. Then it checks if one of them is in a rollback state. As we can imagine, that check will return true if some unhandled exception has occurred during indexing.

If one of the two trackers above reports an active rollback state, the CommitTracker lists all trackers, invalidates their state and issues a rollback command to Solr. That means any update sent to Solr by any tracker will be reverted.

How does the ContentTracker work in shard mode?

The only source that the ContentTracker checks in order to determine the "unclean" content that needs to be updated is the local index. As consequence of that, the ContentTracker behavior is the same regardless the search infrastructure shape and the context where the hosting Solr instance lives. That is, if we are running a standalone Solr instance there will be one a ContentTracker for each core watching the corresponding (monolithic) index. If instead we are in a sharded scenario, each shard will have a ContentTracker instance that will use the local shard index.

How does the ContentTracker work in Master/Slave mode?

In order to properly work in a Master/Slave infrastructure, the Tracker Subsystem (not the only ContentTracker) needs to be

  • enabled on Master(s)
  • disabled on Slaves

The only exceptions to that rule are about:

  • The MetadataTracker: only if the search infrastructure uses dynamic sharding [2] the Metadata tracker is in charge to register the Solr instance (the Shard) to Alfresco so it will be included in the subsequent queries. The tracker itself, in this scenario, won't track anything.
  • The ModelTracker: each Solr instance pulls, by means of this tracker, the custom models from Alfresco, so it must be enabled in any case.

The document file in the SearchService repository provides an additional paragraph with the configuration attributes related with the Tracker subsystem. I didn't put that long table in this post because it doesn't add any information: if you need to configure the trackers just have a look at the end of that document.     

What's next?

The Tracker Subsystem is one of the main areas where the Search Team is devolving analysis and investigation efforts: that will allow to find a space for introducing further improvements in the architecture. 

-------

[1] https://en.wikipedia.org/wiki/Template_method_pattern 

[2] http://docs.alfresco.com/5.1/concepts/solr-shard-config.html

7 Comments