Zend_Search_Lucene Datasource for CakePHP

13 Jan
2010

Major update January 22/10: much of the content of this article has been updated to reflect the changes to the datasource, the latest version of which you can download on Github.

Just out of the oven – a Zend_Search_Lucene datasource for CakePHP (built with 1.2 but probably works just fine in 1.3) that I originally wrote for an in-house CMS site search plugin. I can’t release the plugin itself (and there’s so much CMS-specific code that it would need a lot of work to make it generic anyway), but I thought that someone might find the datasource itself useful. It’s pretty basic at this point and doesn’t implement some of the fancier Zend_Search_Lucene features such as sorting (it just returns sorted in score order, which is probably what you want anyway).

Zend_Search_Lucene is a text-based search index system for developers who don’t want to (or can’t) use a database for search indexing.

Download the current version of the ZendSearchLuceneDatsource from my Github repository.

I won’t go into detail about how to add data into the Lucene database since the Zend Framework documention is so good (CakePHP should be jealous!). You’ll find all the info you need there. There are also a couple of older articles out there that show how you can integrate Zend_Search_Lucene into CakePHP:

Setup

First, copy zend_search_lucene.php to models/datasources.

Then, you’ll need to download the Zend_Search_Lucene library from the Zend Framework website and put some files into your /vendors directory:

  • Zend/Search (the directory and all of its contents)
  • Zend/Exception.php

You’ll also need to update your include path to include app/vendors, since the Zend Framework loads a lot of classes on its own. I also made a little autoload function to make the loading of Zend Framework classes easier. Put the following code somewhere common, such as app/bootstrap.php:

ini_set('include_path', ini_get('include_path') . ':' . CAKE_CORE_INCLUDE_PATH . DS . '/vendors');
function __autoload($path) {
if (substr($path, 0, 5) == 'Zend_') {
include str_replace('_', '/', $path) . '.php';
}
return $path;
}

You also need to put the DB config for the datasource in config/database.php (updated Jan 20/2010 for better DebugKit compatibility):

var $zendSearchLucene = array(
	'datasource' => 'ZendSearchLucene',
	'indexFile' => 'lucene', // stored in the cache dir.
	'driver' => '',
	'source' => 'search_indices'
);

Then, in the model that’ll act as your search index (say, for example, SearchIndex), specify the DB config:

<?php class SearchIndex extends AppModel {
var $useDbConfig = 'zendSearchLucene';
}
?>

Saving/Indexing

I’ve tried to keep the datasource functions as simple and familiar as possible. When saving an item to the index, the datasource expects a multidimensional array for each item. For compatibility with CakePHP’s datasource code, the ‘meat’ of the data is nested in the third level of the array. Each sub-array contains information about a field to be stored. For example:

$saveData = array('SearchIndex' => array(
  	'document' => array(
		array(
			'key' => 'name',
			'value' => $record[$Model->alias][$this->settings[$Model->alias]['name']],
			'type' => 'Text'
		),
		array(
			'key' => 'description',
			'value' => $record[$Model->alias][$this->settings[$Model->alias]['description']],
			'type' => 'Text'
		),
		array(
			'key' => 'url',
			'value' => $this->__constructUrl($Model, $record),
			'type' => 'Text'
		)
	)
 ));

Passing that data in a Model::save() call will in turn execute the following Zend code (more or less – this is a very simplified version of the actual ZendSearchLuceneSource saving code):

$index = Zend_Search_Lucene::open('/path/to/the/index/set/in/dbConfig');
$doc = new Zend_Search_Lucene_Document();
foreach ($data as $field) {
$doc->addField(Zend_Search_Lucene_Field::$field['type']($field['key'], $field['value']));
}
$index->addDocument($doc);

Obviously that’s a basic example; you’ll probably send a whole bunch of dynamic info to the indexer. But that’s the gist of it anyway.

Querying

You can search for records just like you would a regular datasource. Pass the search terms as a “query” condition. If you want the search terms to be highlighted in the returned results, pass ‘highlight’ => true in the array of options. Note that only indexed fields will be highlighted.

You can find all results:

function search($term) {
$results = $this->SearchIndex->find('all', array('highlight' => true, 'conditions' => array('query' => 'best cakephp tutorials')));
}

You can mimic Google’s I’m Feeling Lucky with find(‘first’):

function search($term) {
$topResult = $this->SearchIndex->find('first', array('conditions' => array('query' => 'best cakephp tutorials')));
}

You can even paginate:

function search($term) {
$this->paginate = array(
'limit' => 10,
'conditions' => array('query' => 'best CakePHP tutorials'),
'highlight' => true
);

$results = $this->paginate();
}

Results are returned in the expected CakePHP way, as a multidimensional array – $results[0]['MyModelAlias'] for multiple records, $results['MyModelAlias'] for one (i.e. with find(‘first’)).

There you go – enjoy! As always, comments and suggestions are welcomed.

I used the RSS Feed datasource by Loadsys as a guide to good datasource design. I may have borrowed a function or two. ;)

Neil Crookes’ Searchable plugin also helped.

16 Responses to Zend_Search_Lucene Datasource for CakePHP

Avatar

Guillaume

January 13th, 2010 at 3:34 pm

Good work! I wanted to do it for a long time… too much procrastination!

Avatar

Neil Crookes

January 14th, 2010 at 12:15 pm

Nice one Jamie, thanks for sharing (and the link). How is Lucene’s search algorithm. I know MySQL FullText search can be adjusted, but I find it a bag o’ sh1te!

Avatar

Jamie

January 15th, 2010 at 1:53 pm

Thanks Neil. Yeah, MySQL FullText is just way too slow and limited. Lucene’s indexing methods make searching and sorting way faster than a FullText search. MySQL has some benefits, like the ability to cross-join between indexes and the ability to modify existing records (since in Lucene you don’t update, you just delete and re-enter), but the speed of Lucene and its additional features like proximity searching, fuzzy searching, and term boosting/weighting just make it a better choice all around for larger search indexes.

Avatar

Abba Bryant

January 16th, 2010 at 5:25 pm

I am working with this code and it isn’t indexing anything. I use Luke to look at the index the datasource creates and the tool says the fields don’t exist and no records are available.

What should the inside of the search index folder look like under ‘tmp’ if, for example, I tell the datasource to use an indexFile value of ‘search’?

Making a save call to a Search model using the datasource from the afterSave callback of a Document model seems to create a segments.gen, segments_1, and some .lock files in the tmp/search folder currently. Nothing else. If you like you can email me so we can take this out of the comments until we find out what I am doing wrong.

Avatar

Jamie

January 17th, 2010 at 9:29 am

Hi Abba – it’s difficult to know what the problem is without seeing your code. I can’t promise that I’ll have a lot of time to help, but if you paste in the relevant saving code (along with a sample of the data) then I can give it a look.

Avatar

Major Update / Make-Workage to Zend_Search_Lucene Datasource for CakePHP | Jamie Nay

January 22nd, 2010 at 2:41 pm

[...] datasource for CakePHP. You can find the latest version on Github, and I’ve also updated the tutorial to reflect the [...]

Avatar

Abba Bryant

January 25th, 2010 at 8:26 am

Let me play around with the updated datasource and see if I can spot what I am doing wrong before I post any code.

I was reading about .seg files and Lucene and some issues are known with 32bit php builds. I might be running into that as my local dev seems to break whereas remotely I can (sometimes) get it to index.

Avatar

Jamie

January 25th, 2010 at 8:46 am

Well, I can tell you that the first version of the datasource was woefully broken in a few different ways – you should have more success with the current version. Let me know if you’re still having trouble getting it to work.

Avatar

David Umoh

February 23rd, 2010 at 3:13 am

Thanks Jamie! but i am having some difficulties implementing…i am kinda new to cakephp…i had tried the other tutorial: ‘ Integrating Zend Framework Lucene with your Cake Application’ before seeing yours…i had gone as far as successfully creating the index…it worked…the challenge i had was the quering…the code doesnt seem to work with the latest version of cakephp(which is what i am using)…Yours is more recent and better organised…Please how can i pass the index information from mysql database into your datasource….

I could show you the code i am using to index, if that would help….

Thank you very much

Avatar

Jamie

February 24th, 2010 at 12:41 pm

Sure, David – what does your code look like?

Avatar

Caio Gouveia

February 25th, 2010 at 6:15 am

You save my day Jamie!!
thanks a lot dude !

Avatar

David Umoh

March 1st, 2010 at 6:18 am

thanks Jamie, sorry i am reply now, i was kinda on holiday…Pls find below the code i am using to index: I need to know how to use it with your datasource, that is, read the info from the database and pass the results to your indexer:
addField(Zend_Search_Lucene_Field::UnIndexed(‘document_id’, $document->id));

$doc->addField(Zend_Search_Lucene_Field::Text(‘document_title’, $document->company_name));
$doc->addField(Zend_Search_Lucene_Field::Text(‘document_description’, $document->description));

// Add the document to the index
$index->addDocument($doc);

}

// Commit the index
$index->commit();

?>

Avatar

David Umoh

March 1st, 2010 at 6:21 am

addField(Zend_Search_Lucene_Field::UnIndexed(‘document_id’, $document->id));

$doc->addField(Zend_Search_Lucene_Field::Text(‘document_title’, $document->company_name));
$doc->addField(Zend_Search_Lucene_Field::Text(‘document_description’, $document->description));

// Add the document to the index
$index->addDocument($doc);

}

// Commit the index
$index->commit();

?>

Avatar

David Umoh

March 1st, 2010 at 6:25 am

i found out its truncating my code so let me put the relevant part:
//above we do the necessary bootstraping,
//run the database query to get the records then index:
$companies_rec = mysql_query($sql);

while($document = mysql_fetch_object($companies_rec)) {
// print_r($document);
// Create a new searchable document instance
$doc = new Zend_Search_Lucene_Document();
// Add some information
$doc->addField(Zend_Search_Lucene_Field::UnIndexed(‘document_id’, $document->id));
$doc->addField(Zend_Search_Lucene_Field::Text(‘document_title’, $document->company_name));
$doc->addField(Zend_Search_Lucene_Field::Text(‘document_description’, $document->description));
// Add the document to the index
$index->addDocument($doc);
}
// Commit the index
$index->commit();

?>

Avatar

Joseph Le Brech

June 21st, 2010 at 2:16 am

I’m struggling with this error message.

Unable to import DataSource class .ZendsearchluceneSource [CORE\cake\libs\model\connection_manager.php, line 186]

have i missed a step somehow?

I’m using 1.3 for this.

Avatar

Jamie

June 21st, 2010 at 7:23 pm

Joseph – can you post your code from config/database.php, and also any relevant code in your models?

Comment Form

top