Index of /~harley/sw/qts
Name Last modified Size Description
Parent Directory 25-Aug-2008 22:28 -
qts-20080825.tar.gz 25-Aug-2008 22:56 8k
qts-20080826.tar.gz 26-Aug-2008 21:24 8k
qts-20080827.tar.gz 27-Aug-2008 21:22 8k
qts-20080904.tar.gz 04-Sep-2008 08:52 9k
qts-20080905.tar.gz 05-Sep-2008 00:04 9k
qts-20080909.tar.gz 09-Sep-2008 23:51 9k
qts-20080924.tar.gz 24-Sep-2008 12:12 10k
qts.tar.gz 24-Sep-2008 12:12 10k
#
# ~/projects/python/qts/README ---
#
# $Id: README,v 1.10 2008/09/04 01:55:49 harley Exp $
#
ABOUT ==
QTS (quick text search) is a simple document searcher for
quickly indexing and searching collections of text.
The latest version is available from:
http://www.mahalito.net/~harley/sw/qts/
QUICK START ==
* Index a bunch of stuff into a database.
./qts [--db DBNAME] -i files-or-directories
* now search the db with
./qts [--db DBNAME] terms ...
TERMS ==
* The simplest term is just a word. Documents with the term
are added to the result set.
* Terms may have an optional float weight. The syntax is
"word,weight". Weights may be negative. The default weight
is 1.0.
* Terms may have a "+" modifer. A docment not matching a
"+" term is removed from the result set.
* "path:[+]RE[,weight]" selects documents by their
filesystem path. The RE is matched against the entire
pathname.
* The score of a document is the sum of the each term
scores. The score of a each term is its count times its
weight.
REPORTS ==
* There are three reports available. (-r [1,2,3])
* results are sorted by score.
* The number (-n) and min score (-m) options trim the result set.
* The "--cmd CMD" option runs CMD on the results. These two
commands to view the documents are almost the same:
./qts --cmd less foo bar
and
less $(./qts foo bar)
* the document can be reassembled from the words stored in
the db. Use "--cat-ids DOCID" to see them rebuilt.
LICENSE ==
Copyright 2008 by James Harley Gorrell
TODO ==
* write usage blurb
* extractor rewrite
* per doc-type extractors
* how to deal with keys
* clean up
* Scan docs to see if they have been update in the file system
* Choice of abs paths, relative paths, or filenames.
* Steps of tokenization
* line by line
* match the line to regexps to "fix it"
"foo at bar.com" => "foo@bar.com"
* split into words.
varied patterns.
* toss bad patterns.
* remember byte-pos of line starts and the tok pos of it.
* Flags to enable-disable
* DocType to hold settings:
*.cpp *.h => CppDoc
CppDoc => filter comments, etc
* TextDoc
Skip text docs bigger than a couple mb.
* EmailDoc =>
* "Loose filename" maildir changes the suffix ":2,RS" for
the state of the message. store the base name and Probe
when the filename is needed.
* run time binding of classes and reports.
* Grep style report "-C context"
* upper lower case query terms.
* pathname matching.
* check if the db indexes could be losened?
* doc,tok should be needed but tok,doc is scanned for every doc.
* function to print doc based on tokens in db.
* object-db mapping with "db.save(obj)"
* __set_attrib__ to mark dirty values.