Brainmaker

Nanos gigantium humeris insidentes!

luncene

  • August 8, 2012 8:49 pm

http://kalanir.blogspot.com/2008/06/creating-search-index-in-database.html

 

Creating Lucene Index in a Database – Apache Lucene

My previous post, Indexing a database and searching the content using Lucene, shows how to index records (or stored files) in a database. In that case the index is created in the local file system. However in real scenarios most of the applications run on clustered environments. Then the problem comes where to create the search index.

Creating the index in the local file system is not a solution for the particular situation as the index should be synchronized and shared by every node. One solution is clustering the JVM while using a Lucene RAMDirectory(keep in mind it disappears after a node failure) instead of a FSDirectory. Terracotta framework can be used to cluster the JVM. This blog entry shows a code snippet.

Anyway I thought not to go that far and decided to create the index in the database so that it can be shared by everyone. Lucence contains the JdbcDirectory interface for this purpose. However the implementation of this interface is not shipped with Lucene itself. I found a third party implementation of that. Compass projectprovides the implementation of JdbcDirectory. (No need to worry about compass configurations etc. JdbcDirectory can be used with pure Lucene without bothering about Compass Lucene stuff).

Here is a simple example

  1. //you need to include lucene and jdbc jars
  2. import org.apache.lucene.store.jdbc.JdbcDirectory;
  3. import org.apache.lucene.store.jdbc.dialect.MySQLDialect;
  4. import com.mysql.jdbc.jdbc2.optional.MysqlDataSource;

.

  1. //code snippet to create index
  2. MysqlDataSource dataSource = new MysqlDataSource();
  3. dataSource.setUser(“root”);
  4. dataSource.setPassword(“password”);
  5. dataSource.setDatabaseName(“test”);
  6. dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field
  7. JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), “indexTable”);
  8. jdbcDir.create(); // creates the indexTable in the DB (test). No need to create it manually

.

  1. //code snippet for indexing
  2. StandardAnalyzer analyzer = new StandardAnalyzer();
  3. IndexWriter writer = new IndexWriter(jdbcDir, analyzer, true);
  4. indexDocs(writer, dataSource.getConnection());
  5. System.out.println(“Optimizing…”);
  6. writer.optimize();
  7. writer.close();
  8. static void indexDocs(IndexWriter writer, Connection conn)
  9. throws Exception {
  10.     String sql = “select id, name, color from pet”;
  11.     Statement stmt = conn.createStatement();
  12.     ResultSet rs = stmt.executeQuery(sql);
  13.     while (rs.next()) {
  14.         Document d = new Document();
  15.         d.add(new Field(“id”, rs.getString(“id”), Field.Store.YES, Field.Index.NO));
  16.         d.add(new Field(“name”, rs.getString(“name”), Field.Store.YES, Field.Index.TOKENIZED));
  17.         d.add(new Field(“color”, rs.getString(“color”), Field.Store.YES,  Field.Index.TOKENIZED));
  18.         writer.addDocument(d);
  19.     }
  20. }

This is the indexing part. Searching part is same as the one in my previous post.

Desktop Search Features

  • August 5, 2012 1:06 am
  • all files in sdcard — plain text 
  • user’s text message
  • User’s email
  • Third party
    • dropbox
    • google drive
    • pocket (formerly read it later)

Android

  • August 5, 2012 12:57 am
  • FileObserver

Lucene in 5 minutes

  • August 2, 2012 11:28 pm

http://www.lucenetutorial.com/lucene-in-5-minutes.html

Lucene makes it easy to add full-text search capability to your application. In fact, its so easy, I’m going to show you how in 5 minutes!

1. Index

For this simple case, we’re going to create an in-memory index from some strings.

Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);

IndexWriter w = new IndexWriter(index, config);
addDoc(w, “Lucene in Action”);
addDoc(w, “Lucene for Dummies”);
addDoc(w, “Managing Gigabytes”);
addDoc(w, “The Art of Computer Science”);
w.close();
addDoc() takes a string and adds it to the index:

private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field(“title”, value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

addDoc() takes a string and adds it to the index:

private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field(“title”, value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

 

2. Query

We read the query from stdin, parse it and build a lucene Query out of it.

String querystr = args.length > 0 ? args[0] : “lucene”;
Query q = new QueryParser(Version.LUCENE_35, “title”, analyzer).parse(querystr);

 

3. Search

Using the Query we create a Searcher to search the index. Then instantiate a TopScoreDocCollector to collect the top 10 scoring hits.

int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

 

4. Display

Now that we have results from our search, we display the results to the user.

System.out.println(“Found ” + hits.length + ” hits.”);
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + “. ” + d.get(“title”));
}

Here’s the app in its entirety. Download HelloLucene.java

 

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.IOException;

public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
//    The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

// 1. create the index
Directory index = new RAMDirectory();

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);

IndexWriter w = new IndexWriter(index, config);
addDoc(w, “Lucene in Action”);
addDoc(w, “Lucene for Dummies”);
addDoc(w, “Managing Gigabytes”);
addDoc(w, “The Art of Computer Science”);
w.close();

// 2. query
String querystr = args.length > 0 ? args[0] : “lucene”;

// the “title” arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_35, “title”, analyzer).parse(querystr);

// 3. search
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

// 4. display results
System.out.println(“Found ” + hits.length + ” hits.”);
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + “. ” + d.get(“title”));
}

// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}

private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field(“title”, value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

To use this app from the command line, type java HelloLucene <query>

Basic Concepts

  • August 2, 2012 11:20 pm

http://www.lucenetutorial.com/basic-concepts.html

Lucene is a full-text search library which makes it easy to add search functionality to an application or website.

It does so by adding content to a full-text index. It then searches this index and returns results ranked by either the relevance to the query or by an arbitrary field such as a document’s last modified date.

 

Searching and Indexing

Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

Documents

In Lucene, a Document is the unit of search and index.

An index consists of one or more Documents.

Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

A Lucene Document doesn’t necessarily have to be a document in the common English usage of the word. For example, if you’re creating a Lucene index of a database table of users, then each user would be represented in the index as a Lucene Document.

Fields

A Document consists of one or more Fields. A Field is simply a name-value pair. For example, a Field commonly found in applications is title. In the case of a title Field, the field name is titleand the value is the title of that content item.

Indexing in Lucene thus involves creating Documents comprising of one or more Fields, and adding these Documents to an IndexWriter.

Searching

Searching requires an index to have already been built. It involves creating a Query (usually via a QueryParser) and handing this Query to an IndexSearcher, which returns a list of Hits.

Queries

Lucene has its own mini-language for performing searches. Read more about the Lucene Query Syntax

The Lucene query language allows the user to specify which field(s) to search on, which fields to give more weight to (boosting), the ability to perform boolean queries (AND, OR, NOT) and other functionality.

Your First Lucene Project

  • August 2, 2012 11:08 pm

http://www.lucenetutorial.com/your-first-project.html

1. Start with your search results page

Pay attention to things like what data is to be displayed and how you’d like the results ranked.

2. Map your application to the Lucene model

From the search results page, determine what steps need to be taken to get your data into Lucene.

First, determine what Fields there are in a Document.

Then, if your data is in a database for example, you would determine which database tables and columns need to be accessed, and what SQL select statements need to executed.

3. Write the indexing code

Whether its files or a database that needs to be indexed, start by writing your indexer. Start out simple, don’t worry about efficiency or performance for now.

When the first index has been created, browse the index using Luke, make sure it looks right, i.e. all the fields are there, all documents that should be indexed have been indexed, etc.

 

4. Write the searching code, in a separate class

Its always a good idea to separate the searching from the indexing. The searcher should accept a query string, and return a list of hits.

After you’ve implemented the most basic functionality, add functionality such as limiting the number of results displayed per page and moving between pages. Do add some field boosts where you see fit to emphasize certain fields over others.

5. Implement additional search functionality

By now, you have a really basic search app which takes a query from the user and spits out a list of results. You’ll now want to implement any required search functionality such as filtering by permissions, sorting by date, etc.

6. Ensure your search results make sense

Since you don’t want to look silly in front of your boss, quickly run through some sample queries, ensuring that hits are returning when they should, and that the order in which results are ranked makes sense to the user. You shouldn’t have to go in-depth into query explanations at this stage.

Lucene Peek

  • August 1, 2012 11:44 pm

http://www.codeproject.com/Articles/272309/Lucene-Search-Programming

LuceneArchitecture

 

5. Core Lucene APIs

Lucene has FOUR core libraries(API) as below:

  • org.apache.lucene.document
  • org.apache.lucene.analysis
  • org.apache.lucene.index
  • org.apache.lucene.search
http://www.ibm.com/developerworks/web/library/wa-lucene2/
Web search engine architecture
http://www.ibm.com/developerworks/opensource/library/os-apache-lucenesearch/
Steps in building applications using Lucene
http://www.lucenetutorial.com/lucene-in-5-minutes.html

Kernel Support Vector Machines for Classification and Regression in C#

  • May 13, 2012 1:49 pm

http://crsouza.blogspot.com/2010/04/kernel-support-vector-machines-for.html

Some New Words

  • September 11, 2011 4:46 pm

consonant: 辅音
plateau: 平稳阶段
onset:开端
mnemonics: 记忆术
prevocalic: 元音前的
viz: 也就是

Watson's architecture

  • June 23, 2011 3:09 am

The high-level architecture of IBM's DeepQA used in Watson