Index of /~harley/sw/qts

      Name                    Last modified       Size  Description

[DIR] Parent Directory 25-Aug-2008 22:28 - [   ] qts-20080825.tar.gz 25-Aug-2008 22:56 8k [   ] qts-20080826.tar.gz 26-Aug-2008 21:24 8k [   ] qts-20080827.tar.gz 27-Aug-2008 21:22 8k [   ] qts-20080904.tar.gz 04-Sep-2008 08:52 9k [   ] qts-20080905.tar.gz 05-Sep-2008 00:04 9k [   ] qts-20080909.tar.gz 09-Sep-2008 23:51 9k [   ] qts-20080924.tar.gz 24-Sep-2008 12:12 10k [   ] qts.tar.gz 24-Sep-2008 12:12 10k

#
# ~/projects/python/qts/README ---
#
# $Id: README,v 1.10 2008/09/04 01:55:49 harley Exp $
#

ABOUT ==

QTS (quick text search) is a simple document searcher for
quickly indexing and searching collections of text.  

The latest version is available from:
http://www.mahalito.net/~harley/sw/qts/


QUICK START ==

* Index a bunch of stuff into a database.

  ./qts [--db DBNAME] -i files-or-directories

* now search the db with

  ./qts [--db DBNAME] terms ...


TERMS ==

* The simplest term is just a word. Documents with the term
are added to the result set.

* Terms may have an optional float weight.  The syntax is
"word,weight".  Weights may be negative. The default weight
is 1.0.

* Terms may have a "+" modifer.  A docment not matching a
"+" term is removed from the result set.

* "path:[+]RE[,weight]" selects documents by their
filesystem path.  The RE is matched against the entire
pathname.

* The score of a document is the sum of the each term
scores.  The score of a each term is its count times its
weight.

REPORTS ==

* There are three reports available. (-r [1,2,3])

* results are sorted by score.

* The number (-n) and min score (-m) options trim the result set.

* The "--cmd CMD" option runs CMD on the results.  These two
commands to view the documents are almost the same:

   ./qts --cmd less  foo bar

and

   less $(./qts foo bar)

* the document can be reassembled from the words stored in
the db.  Use "--cat-ids DOCID" to see them rebuilt.


LICENSE ==

Copyright 2008 by James Harley Gorrell


TODO ==

* write usage blurb
* extractor rewrite
* per doc-type extractors
* how to deal with keys
* clean up
* Scan docs to see if they have been update in the file system
* Choice of abs paths, relative paths, or filenames.

* Steps of tokenization
  * line by line
  * match the line to regexps to "fix it"
    "foo at bar.com" => "foo@bar.com"
  * split into words.
    varied patterns.
  * toss bad patterns.
  * remember byte-pos of line starts and the tok pos of it.

* Flags to enable-disable

* DocType to hold settings:
   *.cpp *.h => CppDoc
   CppDoc => filter comments, etc

* TextDoc
  Skip text docs bigger than a couple mb.

* EmailDoc =>
  * "Loose filename" maildir changes the suffix ":2,RS" for
    the state of the message.  store the base name and Probe
    when the filename is needed.

* run time binding of classes and reports.

* Grep style report "-C context"

* upper lower case query terms.

* pathname matching.

* check if the db indexes could be losened?
  * doc,tok should be needed but tok,doc is scanned for every doc.

* function to print doc based on tokens in db.

* object-db mapping with "db.save(obj)"
  * __set_attrib__ to mark dirty values.