소프트웨어/실용주의 프로그래머

오픈소스 Lucene을 활용한 DB Full-text 검색엔진 개발

falconer 2008. 7. 30. 09:24

요즘 검색이 관심이 있어 자료 수집중이다.
.NET 관련된 내용을 찾다고 보니 괜찮은 내용이 있어 펌 했습니다.

-------------------------------------------------------------------------------------------------


태어나서 처음으로 혼자서하는 알바를 해봤다.
2007년 12월 말에 오픈소스 Lucene을 사용하여 검색엔진을 개발해보지 않겠냐는 의뢰가 왔을 때 지금 하고있는 프로젝트들이 많아서 거절하려고도 했었지만 자바와 오픈소스를 사용하여 무엇인가를 개발한다는 것은 내 마음을 무척이나 설레이게 했다. 결국, 시간적 압박보다는 일을 선택했는데 아직 내가 많이 모자라기때문에 처음 계획했던 것 보다는 많이 축소한 단위로 프로젝트를 진행했다.

프로젝트 명은 "오픈소스 Lucene을 활용한 Database Full-text 검색엔진 개발"이다. 이것을 계기로 검색엔진의 원리와 Lucene에 대해 알 수 있었고 많이 친해졌다. 현재 나와있는 구글, 네이버, 다움같은 포털사이트들이 얼마나 많은 연구끝에 생성된 것인지도 알 수 있었다.

2008년 1월부터 2월까지(주 2~3일) 2개월간 진행되었고 Indexing과 DB Crowling이 잘되는 것을 보니 귀엽기도하고 나름대로 애착이간다. 한가지 아쉬운점은 개발기간이 짧아 충분히 많은 테스트를 해보지는 못했다는 점이다. 메뉴얼화 까지 완성했으며 목차는 다음과 같다. 메뉴얼은 회사 프로젝트로 작성한것이기 때문에 전체를 올릴수는 없고 후에 특정부분에 대해서 중요한 특징들은 정리해 올리도록 하겠다. 또한, 혹시나 누군가가 궁금해 하신다면 그 부분도 정리해서 올리도록 하겠다.

1 프로젝트 개요 ······················································································································· 1
1.1 프로젝트 개요 ······························································································································ 1
1.2 수행책임자 인적 사항 ·············································································································· 1
1.3 참여인력 현황 ···························································································································· 1
1.4 시스템 요구 정의서 ···················································································································· 2

2 검색엔진 및 Lucene소개 ······································································································ 4
2.1 검색엔진의 원리 ·························································································································· 4
2.2 Lucene 소개 ································································································································· 6

3 검색엔진의 구현 ·················································································································· 10
3.1 간단한 한글 형태소 분석기 및 태그제거기 ········································································ 10
3.2 DBLoader(데이터베이스 크롤러) ···························································································· 15
3.3 Query 생성기 ····························································································································· 19
3.4 검색 Pool ···································································································································· 20
3.5 Multy Thread Searching ········································································································· 22
3.6 Index Manager 0.1 GUI Application ··················································································· 26
3.7 Log4j를 활용한 Log기능 ········································································································· 30

4 검색엔진 설치  ····················································································································· 33
4.1 검색엔진 설치 ···························································································································· 33
4.2 IIS / Tomcat 연동 ···················································································································· 38

5 참고자료 ································································································································ 42
5.1 Lucene의 하위 Project ············································································································· 42
5.2 Lucene Sendbox ······················································································································· 43
5.3 Nutch ··········································································································································· 45
5.4 참고문헌 ······································································································································ 47
5.5 기타 ·············································································································································· 47



주요 특징은 다음과 같다.

1.1 프로젝트 개요
  - Database에서 like검색(%검색어%) 수행시에 전체 text를 다 보기 때문에 무척 느린것을 알 수 있다. 이런 경우 DB의 Contents(내용들)을 단어별로 쪼개서 Indexing한 후 Index로부터 검색을 하면 몇백만건 혹은 그 이상의 데이터를 full-text 검색을 하는데 속도를 1초 이하로 줄일 수 있다. 현재 SQLServer, Mysql 등 몇몇의 DBMS는 like 검색 외에도 full-text검색을 지원한다.
  전체적인 프로젝트의 의도는 DBMS 크롤링 뿐만 아니라 웹페이지 크롤링 등등 여러가지로부터 문서를 크롤링하고 크롤링된 문서들을 적절하게 클러스터링 한 후 데이터를 검색할 때 적합한 클러스터로 부터 데이터를 질적으로 빠르게 검색하는 것 까지였다.(여기에 시멘틱+온톨로지 추가)  하지만 프로젝트의 범위가 너무 크고 쉽게 구현되기 힘든 내용들이며 내가 곧 4학년이되기 때문에 가능한한 범위내에서 내가 맡은 부분은 DBMS 크롤링과 검색까지만 진행되었다.

3.1 간단한 한글 형태소 분석기 및 태그제거기
  - 루씬을 사용해서 검색엔진을 개발해본 사람이라면 누구나 형태소 분석기 때문에 고민을 하게된다. 왜냐면 루씬에서 사용할만한 오픈되어진 한글형태소 분석기가 마땅히 존재하지 않기 때문이다. 본 프로젝트에서는 이러한 문제점을 해결하지는 못했고 html tag와 유니코드, 접미사를 제거하는 간단한 형태소 분석기만을 구현하였다. 좀 더 발전된 형태소 분석기를 개발하고 싶었지만 시간의 제약상 이정도로만 만족해야 했다. 형태소 분석기를 개발하기 위해서는 형태소 분석에 대한 지식뿐만아니라 단어사전 또한 존재해야 한다. 형태소 분석을 효율적으로 할 경우 검색의 정확성을 보다 높일 수 있다.
  - UNITEX라는 오픈소스가 존재한다. 그곳에서는 완벽하지는 않지만 한국어 사전인 DECO와 한국어 형태소 분석기를 지원한다. 아직 UNITEX가 어느정도 수준의 형태소 분석을 제공하는지 자세히 알아보지 못했지만 후에 UNITEX의 한국어 전자사전인 DECO만을 사용하던지 혹은 UNITEX의 형태소분석기를 사용하여 Lucene의 형태소 분석기를 개발하는 것도 적은시간안에 형태소분석기를 개발하는 빠른 방법인 것 같다.

3.2 DBLoader(데이터베이스 크롤러)
  - DBLoader(데이터베이스 크롤러)는 사용자가 어떤 DB의 어떤 Table의 어떤 Column들을 Crowling할 것인지 간단히 설정만 해주면 어떤 DB든지(현재 Oracle, Mysql, SQLServer (추가 가능)) 관계없이 내용을 Crowling하여 Unique한 Directory에 인덱싱한다.

3.3 Query 생성기
  - 현재 구현된 형태소 분석기가 만족할만한 수준이 되지 않기 때문에 검색을 하면 Like검색의 50%수준밖에 검색을 하지 못한다. 이를 보완하기 위해서 Prefix Query를 사용하며 Prefix Query를 사용 할 경우 Like 검색의 75% 수준까지 검색을 해준다. 검색엔진의 주된 목적은 사용자가 원하는 Top N개를 검색해 주는 것인데 75%의 검색정도면 어느정도의 Top N개의 검색을 만족해 준다고 생각한다.
  Query생성기는 옵션에 따라 다양한 Column을 검색할 수 있는 Boolean Query와 Prefix Query를 생성한다.

3.4 검색 Pool
  - 사용자로부터 검색요청이 왔을 때 검색객체를 생성하고 사용 후 소멸시키는 것은 자원 효율상 매우 아까운 행위이다. 갑자기 사용자로부터 검색요청이 몰릴경우 무지무지 많은 객체들이 생성되고 바로 소멸되며 이는  서버에 부담을 줄 수도 있다. 검색 Pool방식은 사용자의 요청이 있을 때 검색객체를 생성해서 최대 사용자가 설정한 N개까지만 생성하게 해주며 생성된 검색 객체는 소멸되지 않고 Pool을 통해 Management 된다.

3.5 Multy Thread Searching
  - 사용자에게 다중 테이블 검색에대한 요청이 왔을 경우 테이블 수만큼 Thread가 생성되어 동시에 여러 Index를 검색하며 동시에 반환해줌으로 써 속도를 절약시킨다.

3.6 Index Manager 0.1 GUI Application
  - 사용자가 GUI상으로 원하는 DB의 Table을 Indexing할 수 있도록 해주며 스케줄러를 가동시켜 원하는 시간대에 스케줄러가 DB로부터 새로운 글을 읽어와 Index를 추가/수정/삭제할 수 있게한다.
  - 현재는 Swing으로 구현되었지만 후에 SWT 혹은 JSP Interface로 변경시킬 것이다.

3.7 Log4j를 활용한 Log기능
  - Log4j를 사용하여 여러가지 Exception이라던지 Scheduler가 수행한 내용들에 대하여 Log를 기록한다. 이는 후에 유지보수를 위해 매우 중요하다.

4.2 IIS / Tomcat 연동
  - 검색엔진은 Java로 개발되었기 때문에 반드시 Tomcat에서 돌려야한다. 만약 Web Page가 IIS 서버에서 돌아가는 언어로 작성되었다면 검색엔진을 바로 사용할 수 없다. 때문에 Jakarta Redirector를 설치하여 IIS와 Tomcat을 연동하였고 asp 혹은 php Web page에서도 Java 검색엔진을 사용할 수 있게된다.

출처 : http://cherrykyun.tistory.com/171