lucene 3.0.0 简单入门

nello

浏览: 594801 次
性别:
来自: 上海

最近访客更多访客>>

mfcai

qq113220715

钮晓东

cxykyw-2

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucene

lucene

官方例子：http://lucene.apache.org/java/3_0_0/api/demo/index.html
官方网站：http://lucene.apache.org
概念理解：http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/ 这里更多的是概念，一些接口对于3.0.0已经不适用了。
老外的一个网站，英文基本比较浅显易懂：http://www.lucenetutorial.com

需要的jar包：
lucene-core-3.0.0.jar --lucene核心包
lucene-smartcn-3.0.0.jar ---中文分词库，你也可以选择其他的分词jar包

实际例子：TxtFileIndexer .java，这里懒得再写个搜索类了，索引的生成和搜索都放在同一个地方了。

 /**
 *
 */
package com.spell;

import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;

public class TxtFileIndexer {
 public static void main(String[] args) throws Exception {
  // 这里先执行索引的建立，当然生产环境下是有自己的索引生成策略的
  createIndexes();

  // 执行搜索
  search("修改");
 }

 public static void createIndexes() throws Exception {
  // 索引文件的存放文件夹
  String index_path = "E:" + File.separator + "lucene" + File.separator
    + "index";

  // 要索引文件的文件夹
  String doc_path = "E:" + File.separator + "lucene" + File.separator
    + "doc";

  File INDEX_DIR = new File(index_path);
  // 保存在硬盘，也可以选择存放在内存中的
  Directory dir = new SimpleFSDirectory(INDEX_DIR);
  File doc_dir = new File(doc_path);
  // 用智能的中文词库分析器
  Analyzer luceneAnalyzer = new SmartChineseAnalyzer(
    Version.LUCENE_CURRENT);
  File[] dataFiles = doc_dir.listFiles();
  // 索引writer
  IndexWriter indexWriter = new IndexWriter(dir, luceneAnalyzer, true,
    IndexWriter.MaxFieldLength.UNLIMITED);
  long startTime = new Date().getTime();
  for (int i = 0; i < dataFiles.length; i++) {
   if (dataFiles[i].isFile()
     && dataFiles[i].getName().endsWith(".txt")) {
    System.out.println("Indexing file "
      + dataFiles[i].getCanonicalPath());
    Document document = new Document();
    Reader txtReader = new FileReader(dataFiles[i]);
    // Field .Text("path", dataFiles[i].getCanonicalPath())
    document.add(new Field("path", dataFiles[i].getCanonicalPath(),
      Field.Store.YES, Field.Index.NOT_ANALYZED));
    document.add(new Field("contents", txtReader));
    // 索引添加Document
    indexWriter.addDocument(document);
   }
  }
  indexWriter.optimize();
  indexWriter.close();
  long endTime = new Date().getTime();

  System.out.println("It takes " + (endTime - startTime)
    + " milliseconds to create index for the files in directory "
    + doc_dir.getPath());
 }

 public static void search(String searchStr) throws Exception {
  System.out.println("=====搜索的关键字是：" + searchStr);
  // 索引文件的存放路径
  String index_path = "E:" + File.separator + "lucene" + File.separator
    + "index";
  File INDEX_DIR = new File(index_path);
  // 打开文件夹
  FSDirectory directory = FSDirectory.open(INDEX_DIR);
  // 索引搜苏器
  IndexSearcher searcher = new IndexSearcher(directory);
  if (!INDEX_DIR.exists()) {
   System.out.println("The Lucene index is not exist");
   return;
  }
  QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
    "contents", new SmartChineseAnalyzer(Version.LUCENE_CURRENT));
  Query query = parser.parse(searchStr);
  TopDocs topDocs = searcher.search(query, 1000);// 一般来说，只取得前面的1000条，我们认为是最有用的
  System.out.println("tatol:" + topDocs.totalHits);

  for (int i = 3; i < 6; i++) {// 这个思路扩展下可以做分页了
   // for (ScoreDoc scordoc : topDocs.scoreDocs) {
   // 根据索引的ID找文档
   Document tempDoc = searcher.doc(topDocs.scoreDocs[i].doc);
   System.out.println(topDocs.scoreDocs[i].doc + ":--"
     + tempDoc.getField("path").stringValue());
  }
  //关闭文件夹
  directory.close();
  //关闭搜索器
  searcher.close();
 }
}

这个只是初步的例子，比如我们的新闻系统中，不可能每添加一次新闻，就要做全部的索引那样太不明智了，
正确的做法是只添加跟这篇文章相关的索引，更新、删除文章的时候也同步更新、删除索引。

我们看下
IndexWriter 的API

 void	addDocument(Document doc) 
Adds a document to this index.
 void	addDocument(Document doc, Analyzer analyzer) 
Adds a document to this index, using the provided analyzer instead of the value of getAnalyzer().
 void	deleteAll() 
Delete all documents in the index.
 void	deleteDocuments(Query... queries) 
Deletes the document(s) matching any of the provided queries.
 void	deleteDocuments(Query query) 
Deletes the document(s) matching the provided query.
 void	deleteDocuments(Term... terms) 
Deletes the document(s) containing any of the terms.
 void	deleteDocuments(Term term) 
Deletes the document(s) containing term.
 void	updateDocument(Term term, Document doc) 
Updates a document by first deleting the document(s) containing term and then adding the new document.
 void	updateDocument(Term term, Document doc, Analyzer analyzer) 
Updates a document by first deleting the document(s) containing term and then adding the new document.

看了这个，估计就有思路了。

0
顶

0
踩

分享到：

oracle substr/instr/translate 函数使用介 ... | Linux下查看文件和文件夹大小的df和du命令

2011-07-09 19:11
浏览 1502
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论