When accessing to SOLR Web Admin Console, both alfresco and archive cores provide an Overview section that includes relevant data.
In this case, we have an Alfresco core including 5,928 documents with 104.82 MB of disk storage.
A full search of every document in the core can be obtained by using the /select handler with an asterisk search criteria.
This query returns a number of 2,825 documents, that is quite less than the original 5,928 documents obtained in the Overview section.
Finding the missing documents
Alfresco SOLR Indexes are storing Nodes from Alfresco Repository (2,825 in this case) but they are also including additional indexing documents required to perform tracking and searching operations. Every document on an Alfresco SOLR index includes a property named DOC_TYPE that describes the type of the document. Exploring this field in the Schema option, gives us the total count of documents (5,928).
- 3,036 Tx documents to track Repository database transactions
- 2,825 Node documents to track Alfresco Repository Nodes (folders, files and other types of nodes)
- 58 Acl documents to track permissions
- 7 AclTx documents to track ACL transactions
- 2 State documents to track internal status of the SOLR Core
Estimating the storage size for hidden documents
Using tools like Luke, these documents (Tx, Acl, AclTx and State) can be removed from the Index in order to calculate the disk storage required by them.
Once all these documents have been deleted, preserving only the 2,825 Node documents for Alfresco Repository Nodes, the count of documents and the storage size is providing the raw information for Alfresco Nodes.
Around 4 MB have been removed from the original storage size (104.82), so this is the storage used by the hidden documents in this SOLR Index.
Recap
- Don't use these techniques on a living Alfresco SOLR Index, as you can corrupt the result and it will not work anymore with Alfresco Repository
- This information is partially replicated on every SOLR Shard, so when splitting your SOLR Index in Shards you need to provide additional storage than the expected Index Storage Size / # of Shards
- When splitting into shards there is also additional replicated information for term vectors and doc values that depends on your custom content model