glimpseindex man page on IRIX

Man page or keyword search:  
man Server   31559 pages
apropos Keyword Search (all sections)
Output format
IRIX logo
[printable version]



     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

     NAME
	  glimpseindex 3.0 - index whole file systems to be searched
	  by glimpse

     OVERVIEW
	  Glimpse (which stands for GLobal IMPlicit SEarch) is an
	  indexing and query system that allows you to search through
	  all your files very quickly.	Glimpseindex is the indexing
	  program for glimpse.	Glimpse supports most of agrep's
	  options (agrep is our powerful version of grep) including
	  approximate matching (e.g., finding misspelled words),
	  Boolean queries, and even some limited forms of regular
	  expressions. It is used in the same way, except that you
	  don't have to specify file names.  So, if you are looking
	  for a needle anywhere in your file system, all you have to
	  do is say glimpse needle and all lines containing needle
	  will appear preceded by the file name.  See man glimpse for
	  details on how to use glimpse.

	  Glimpseindex provides three indexing options: a tiny index
	  (2-3% of the total size of all files), a small index (7-8%)
	  and a medium-size index (20-30%).  Search times are normally
	  better with larger indexes.  To index all your files, you
	  say glimpseindex ~ for tiny index (where ~ stands for the
	  home directory), glimpseindex -o ~ for small index, and
	  glimpseindex -b ~ for medium.

	  Mail glimpse-request@cs.arizona.edu to be added to the
	  glimpse mailing list.	 Mail glimpse@cs.arizona.edu to report
	  bugs, ask questions, discuss tricks for using glimpse, etc.
	  (this is a moderated mailing list with very little traffic,
	  mostly announcements).  HTML version of these manual pages
	  can be found in
	  http://glimpse.cs.arizona.edu:1994/glimpseindexhelp.html
	  Also, see the glimpse developers home page in
	  http://glimpse.cs.arizona.edu:1994/

     SYNOPSIS
	  glimpseindex [ -abEfFiInos -w number -dD filename(s) -H
	  directory -M number -S number ] directory_name[s]

     INTRODUCTION
	  Glimpseindex builds an index of all text files in all the
	  directories specified and all their subdirectories
	  (recursively).  It is also possible to build several
	  separate indexes (possibly even overlapping).	 The simplest
	  way to index your files is to say

	  glimpseindex ~

	  The index consists of several files (described in detail
	  below), all with the prefix .glimpse_ stored in the user's

     Page 1					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	  home directory (unless otherwise specified with the -H
	  option).  Files with one of the following suffixes are not
	  indexed: ".o", ".gz", ".Z", ".z", ".hqx", ".zip", ".tar".
	  (Unless the -z option is used, see below.)  In addition,
	  glimpseindex attempts to determine whether a file is a text
	  file and does not index files that it thinks are not text
	  files.  Numbers are not indexed unless the -n option is
	  used.	 It is possible to prevent specified files from being
	  indexed by adding their names to the .glimpse_exclude file
	  (described below).  The -o option builds a larger index
	  (typically by a factor of 2-3), allowing for a faster search
	  (1-5 times faster).  The -b builds an even larger index and
	  allows an even faster search.	 There is an incremental
	  indexing option -f, which updates an existing index by
	  determining which files have been created or modified since
	  the index was built and adding them to the index (see -f).
	  Glimpseindex is reasonably fast, taking about 20 minutes to
	  index 100MB from scratch (on a SUN Sparc 5) and 2-4 minutes
	  to update an existing index. (Your mileage may vary.)	 It is
	  also possible to increment the index by adding a specific
	  file (the -a option).

	  Once an index is built, searching for pattern is as easy as
	  saying

	  glimpse pattern

	  (See man glimpse for all glimpse's options and features.)

     A DETAILED DESCRIPTION OF GLIMPSEINDEX
	  Glimpse does not automatically index files.  You have to
	  tell it to do it.  This can be done manually, but a better
	  way is to set it to run every night.	It is probably a good
	  idea to run glimpseindex manually for the first time to be
	  sure it works properly.  The following is a simple script to
	  run glimpseindex every night.	 We assume that this script is
	  stored in a file called glimpse.script:

	  glimpseindex -w 1000 ~ >& .glimpse_out
	  at -m 0300 glimpse.script
	  (It might be interesting to collect all the outputs of
	  glimpse by changing >& to >>& so that the file .glimpse_out
	  maintains a history.	In this case the file must be created
	  before the first time >>& is used.  If you use ksh, replace
	  '>&' with '2>&1'.)

	  Glimpseindex stores the names of all the files that it
	  indexed in the file .glimpse_filenames.  Each file is listed
	  by its full path name as obtained at the time the files were
	  indexed.  For example, /usr1/udi/file1.  Glimpse uses this
	  full name when it performs the search, so the name must
	  match the current name.  This may become a problem when the

     Page 2					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	  indexing and the search are done from different machines
	  (e.g., through NFS), which may cause the path names to be
	  different.  For example, /tmp_mnt/R/xxx/xxx/usr1/udi/file1.
	  (The same is true for several other .glimpse files.  See
	  below.)

	  Glimpseindex does not follow symbolic links unless they are
	  explicitly included in the .glimpse_include file (described
	  below).

	  Glimpseindex makes an effort to identify non-text files such
	  as binary files, compressed files, uuencoded files,
	  postscript files, binhex files, etc.	These files are
	  automatically not indexed.  In addition, all files whose
	  names end with `.o', `.gz', `.Z', `.z', `.hqx', `.zip', or
	  `.tar' will not be indexed (unless they are specifically
	  included in .glimpse_include - see below).

	  The options for glimpseindex are as follows:

	  -a   adds the given file[s] and/or directories to an
	       existing index.	Any given directory will be traversed
	       recursively and all files will be indexed (unless they
	       appear in .glimpse_exclude; see below).	Using this
	       option is generally much faster than indexing
	       everything from scratch, although in rare cases the
	       index may not be as good. If for some reason the index
	       is full (which can happen unless -o or -b are used)
	       glimpseindex -a will produce an error message and will
	       exit without changing the original index.

	  -b   builds a medium-size index (20-30% of the size of all
	       files), allowing faster search.	This option forces
	       glimpseindex to store an exact (byte level) pointer to
	       each occurrence of each word (except for some very
	       common words belonging to the stop list).

	  -B   uses a hash table that is 4 times bigger (256k entries
	       instead of 64K) to speed up indexing. The memory usage
	       will increase typically by about 2 MB.  This option is
	       only for indexing speed; it does not affect the final
	       index.

	  -d filename(s)
	       deletes the given file(s) from the index.

	  -D filename(s)
	       deletes the given file(s) from the list of file names,
	       but not from the index.	This is much faster than -d,
	       and the file(s) will not be found by glimpse.  However,
	       the index itself will not become smaller.

     Page 3					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	  -E   does not run a check on file types.  Glimpse normally
	       attempts to exclude non-text files, but this attempt is
	       not always perfect.  With -E, glimpseindex indexes all
	       files, except those that are specifically excluded in
	       .glimpse_exclude and those whose file names end with
	       one of the excluded suffixes.

	  -f   incremental indexing.  glimpseindex scans all files and
	       adds to the index only those files that were created or
	       modified after the current index was built.  If there
	       is no current index or if this procedure fails,
	       glimpseindex automatically reverts to the default mode
	       (which is to index everything from scratch).  This
	       option may create an inefficient index for several
	       reasons, one of which is that deleted files are not
	       really deleted from the index.  Unless changes are
	       small, mostly additions, and -o is used, we suggest to
	       use the default mode as much as possible.

	  -F   Glimpseindex receives the list of files to index from
	       standard input.

	  -H directory
	       Put or update the index and all other .glimpse files
	       (listed below) in "directory".  The default is the home
	       directory.  When glimpse is run, the -H option must be
	       used to direct glimpse to this directory, because
	       glimpse assumes that the index is in the home directory
	       (see also the -H option in glimpse).

	  -i   Make .glimpse_include (SEE GLIMPSEINDEX FILES) take
	       precedence over .glimpse_exclude, so that, for example,
	       one can exclude everything (by putting *) and then
	       explicitly include files.

	  -I   Instead of indexing, only show (print to standard out)
	       the list of files that would be indexed.	 It is useful
	       for filtering purposes.	("glimpseindex -I dir |
	       glimpseindex -F" is the same as "glimpseindex dir".)

	  -M x Tells glimpseindex to use x MB of memory for temporary
	       tables.	The more memory you allow the faster
	       glimpseindex will run.  The default is x=2.  The value
	       of x must be a positive integer.	 Glimpseindex will
	       need more memory than x for other things, and
	       glimpseindex may perform some 'forks', so you'll have
	       to experiment if you want to use this option.  WARNING:
	       If x is too large you may run out of swap space.

	  -n   Index numbers as well as text.  The default is not to
	       index numbers.  This is useful when searching for dates
	       or other identifying numbers, but it may make the index

     Page 4					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	       very large if there are lots of numbers.	 In general,
	       glimpseindex strips away any non-alphabetic character.
	       For example, the string abc123 will be indexed as abc
	       if the -n option is not used and as abc123 if it is
	       used.  Glimpse provides warnings (in .glimpse_messages)
	       for all files in which more than half the words that
	       were added to the index from that file had digits in
	       them (this is an attempt to identify data files that
	       should probably not be indexed).	 One can use the
	       .glimpse_exclude file to exclude data files or any
	       other files.  (See GLIMPSEINDEX FILES.)

	  -o   Build a small index rather than tiny (meaning 7-9% of
	       the sizes of all files - your mileage may vary)
	       allowing faster search.	This option forces
	       glimpseindex to allocate one block per file (a block
	       usually contains many files).  A detailed explanation
	       of how blocks affect glimpse can be found in the
	       glimpse article.	 (See also LIMITATIONS.)

	  -s   supports structured queries.  This option was added to
	       support the Harvest project and it is applicable mostly
	       in that context.	 See STRUCTURED QUERIES below for more
	       information and also http://harvest.cs.colorado.edu for
	       more information about the Harvest project.

	  -S k The number k determines the size of the stop-list.  The
	       stop-list consists of words that are too common and are
	       not indexed (e.g., 'the' or 'and').  Instead of having
	       a fixed stop-list, glimpseindex figures out the words
	       that are too common for every index separately.	The
	       rules are different for the different indexing options.
	       The tiny index contains all words (the savings from a
	       stop-list are too small to bother).  The small index
	       (-o), the number k is a percentage threshold.  A word
	       will be in the stop list if it appears in at least k%
	       of all files.  The default value is 80%.	 (If there are
	       less than 256 files, then the stop-list is not
	       maintained.)  The medium index (-b) counts all
	       occurrences of all words, and a word is added to the
	       stop-list if it appears at least k times per MByte.
	       The default value is 500.  A query that includes a stop
	       list word is of course less efficient.  (See also
	       LIMITATIONS below.)

	  -w k Glimpseindex does a reasonable, but not a perfect, job
	       of determining which files should not be indexed.
	       Sometimes a large text file should not be indexed; for
	       example, a dictionary may match most queries.  The -w
	       option stores in a file called .glimpse_messages (in
	       the same directory as the index) the list of all files
	       that contribute at least k new words to the index.  The

     Page 5					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	       user can look at this list of files and decide which
	       should or should not be indexed.	 The file
	       .glimpse_exclude contains files that will not be
	       indexed (see more below).  We recommend to set k to
	       about 1000.  This is not an exact measure.  For
	       example, if the same file appears twice, then the
	       second copy will not contribute any new words to the
	       dictionary (but if you exclude the first copy and index
	       again, the second copy will contribute).

	  -z   Allow customizable filtering, using the file
	       .glimpse_filters to perform the programs listed there
	       for each match.	The best example is
	       compress/decompress.  If .glimpse_filters include the
	       line
	       *.Z   uncompress <
	       (separated by tabs) then before indexing any file that
	       matches the pattern "*.Z" (same syntax as the one for
	       .glimpse_exclude) the command listed is executed first
	       (assuming input is from stdin, which is why uncompress
	       needs <) and its output (assuming it goes to stdout) is
	       indexed.	 The file itself is not changed (i.e., it
	       stays compressed).  Then if glimpse -z is used, the
	       same program is used on these files on the fly.	Any
	       program can be used (we run 'exec').  For example, one
	       can filter out parts of files that should not be
	       indexed.	 Glimpseindex tries to apply all filters in
	       .glimpse_filters in the order they are given.  For
	       example, if you want to uncompress a file and then
	       extract some part of it, put the compression command
	       (the example above) first and then another line that
	       specifies the extraction.  Note that this can slow down
	       the search because the filters need to be run before
	       files are searched.

     GLIMPSEINDEX FILES
	  All files used by glimpse are located at the directory(ies)
	  where the index(es) is (are) stored and have .glimpse_ as a
	  prefix.  The first two files (.glimpse_exclude and
	  .glimpse_include) are optionally supplied by the user.  The
	  other files are built and read by glimpse.

	  .glimpse_exclude
	       contains a list of files that glimpseindex is
	       explicitly told to ignore. In general, the syntax of
	       .glimpse_exclude/include is the same as that of agrep
	       (or any other grep).  The lines in the .glimpse_exclude
	       file are matched to the file names, and if they match,
	       the files are excluded.	Notice that agrep matches to
	       parts of the string!  e.g., agrep /ftp/pub will match
	       /home/ftp/pub and /ftp/pub/whatever.  So, if you want
	       to exclude /ftp/pub/core, you just list it, as is, in

     Page 6					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	       the .glimpse_exclude file.  If you put
	       "/home/ftp/pub/cdrom" in .glimpse_exclude, every file
	       name that matches that string will be excluded, meaning
	       all files below it.  You can use ^ to indicate the
	       beginning of a file name, and $ to indicate the end of
	       one, and you can use * and ? in the usual way.  For
	       example /ftp/*html will exclude /ftp/pub/foo.html, but
	       will also exclude /home/ftp/pub/html/whatever;  if you
	       want to exclude files that start with /ftp and end with
	       html use ^/ftp*html$ Notice that putting a * at the
	       beginning or at the end is redundant (in fact, in this
	       case glimpseindex will remove the * when it does the
	       indexing).  No other meta characters are allowed in
	       .glimpse_exclude (e.g., don't use .* or # or |).	 Lines
	       with * or ? must have no more than 30 characters.
	       Notice that, although the index itself will not be
	       indexed, the list of file names (.glimpse_filenames)
	       will be indexed unless it is explicitly listed in
	       .glimpse_exclude.

	  .glimpse_filters
	       See the description above for the -z option.

	  .glimpse_include
	       contains a list of files that glimpseindex is
	       explicitly told to include in the index even though
	       they may look like non-text files.  Symbolic links are
	       followed by glimpseindex only if they are specifically
	       included here.  The syntax is the same as the one for
	       .glimpse_exclude (see there).  If a file is in both
	       .glimpse_exclude and .glimpse_include it will be
	       excluded unless -i is used.

	  .glimpse_filenames
	       contains the list of all indexed file names, one per
	       line.  This is an ASCII file that can also be used with
	       agrep to search for a file name leading to a fast find
	       command.	 For example,
	       glimpse 'count#\.c$' ~/.glimpse_filenames
	       will output the names of all (indexed) .c files that
	       have 'count' in their name (including anywhere on the
	       path from the index).  Setting the following alias in
	       the .login file may be useful:
	       alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

	  .glimpse_index
	       contains the index.  The index consists of lines, each
	       starting with a word followed by a list of block
	       numbers (unless the -o or -b options are used, in which
	       case each word is followed by an offset into the file
	       .glimpse_partitions where all pointers are kept).  The
	       block/file numbers are stored in binary form, so this

     Page 7					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	       is not an ASCII file.

	  .glimpse_messages
	       contains the output of the -w option (see above).

	  .glimpse_partitions
	       contains the partition of the indexed space into blocks
	       and, when the index is built with the -o or -b options,
	       some part of the index.	This file is used internally
	       by glimpse and it is a non-ASCII file.

	  .glimpse_statistics
	       contains some statistics about the makeup of the index.
	       Useful for some advanced applications and customization
	       of glimpse.

     STRUCTURED QUERIES
	  Glimpse can search for Boolean combinations of
	  "attribute=value" terms by using the Harvest SOIF parser
	  library (in glimpse/libtemplate). To search this way, the
	  index must be made by using the -s option of glimpseindex
	  (this can be used in conjunction with other glimpseindex
	  options). For glimpse and glimpseindex to recognize
	  "structured" files, they must be in SOIF format. In this
	  format, each value is prefixed by an attribute-name with the
	  size of the value (in bytes) present in "{}" after the name
	  of the attribute. For example, The following lines are part
	  of an SOIF file:
	  type{17}:	  Directory-Listing
	  md5{32}:	  3858c73d68616df0ed58a44d306b12ba
	  Any string can serve as an attribute name.  Glimpse
	  "pattern;type=Directory-Listing" will search for "pattern"
	  only in files whose type is "Directory-Listing".  The file
	  itself is considered to be one "object" and its name/url
	  appears as the first attribute with an "@" prefix; e.g.,
	  @FILE { http://xxx... } The scope of Boolean operations
	  changes from records (lines) to whole files when structured
	  queries are used in glimpse (since individual query terms
	  can look at different attributes and they may not be
	  "covered" by the record/line).  Note that glimpse can only
	  search for patterns in the value parts of the SOIF file:
	  there are some attributes (like the TTL, MD5, etc.) that are
	  interpreted by Harvest's internal routines.  See
	  http://harvest.cs.colorado.edu/harvest/user-manual/ for more
	  detailed information of the SOIF format.

     HOW TO DETERMINE THE INDEX TYPE
	  If you want to determine the type of an existing index,
	  check the first 3 lines of the file ".glimpse_index" (which
	  can be obtained by running "head -3 .glimpse_index").	 These
	  lines always begin with "%".	If the first line has the
	  string "1234567890" after the "%", it means that numbers

     Page 8					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	  were indexed (glimpseindex -n); otherwise, it means that
	  numbers were not indexed.  If the second line has a 0 after
	  the "%", then a tiny (default) index was created by
	  glimpseindex; if there is a negative integer after the "%",
	  then a medium sized index was created (glimpseindex -b); if
	  there is a positive integer after the "%", then a small
	  index was created (glimpseindex -o). In the latter two
	  cases, the absolute value of the integer tells you the
	  number of files that were indexed. On the third line, if the
	  "-s" option of glimpseindex was used to build an index for
	  structured queries, the positive integer after the "%" tells
	  you the number of attributes that were found; if not, the
	  third line just contains a "%0".

     REFERENCES
	  1.   U. Manber and S. Wu, "GLIMPSE: A Tool to Search Through
	       Entire File Systems," Usenix Winter 1994 Technical
	       Conference, San Francisco (January 1994), pp. 23-32.
	       Also, Technical Report #TR 93-34, Dept. of Computer
	       Science, University of Arizona, October 1993 (a
	       postscript file is available by anonymous ftp at
	       cs.arizona.edu:reports/1993/TR93-34.ps).

	  2.   S. Wu and U. Manber, "Fast Text Searching Allowing
	       Errors," Communications of the ACM 35 (October 1992),
	       pp. 83-91.

     SEE ALSO
	  agrep(1), ed(1), ex(1), glimpse(1), glimpseserver(1),
	  grep(1V), sh(1), csh(1).

     LIMITATIONS
	  The index of glimpse is word based.  A pattern that contains
	  more than one word cannot be found in the index.  The way
	  glimpse overcomes this weakness is by splitting any multi-
	  word pattern into its set of words and looking for all of
	  them in the index.  For example, glimpse 'linear
	  programming' will first consult the index to find all files
	  containing both linear and programming, and then apply agrep
	  to find the combined pattern.	 This is usually an effective
	  solution, but it can be slow for cases where both words are
	  very common, but their combination is not.

	  The index of glimpse stores all patterns in lower case.
	  When glimpse searches the index it first converts all
	  patterns to lower case, finds the appropriate files, and
	  then searches the actual files using the original patterns.
	  So, for example, glimpse ABCXYZ will first find all files
	  containing abcxyz in any combination of lower and upper
	  cases, and then searches these files directly, so only the
	  right cases will be found.  One problem with this approach
	  is discovering misspellings that are caused by wrong cases.

     Page 9					     (printed 11/3/95)

     GLIMPSEINDEX(l) UNIX System V (October 11, 1995)  GLIMPSEINDEX(l)

	  For example, glimpse -B abcXYZ will first search the index
	  for the best match to abcxyz (because the pattern is
	  converted to lower case); it will find that there are
	  matches with no errors, and will go to those files to search
	  them directly, this time with the original upper cases. If
	  the closest match is, say AbcXYZ, glimpse may miss it,
	  because it doesn't expect an error.  Another problem is
	  speed.  If you search for "ATT", it will look at the index
	  for "att".  Unless you use -w to match the whole word,
	  glimpse may have to search all files containing, for
	  example, "Seattle" which has "att" in it.

	  There is no size limit for simple patterns and simple
	  patterns with Boolean AND.  More complicated patterns are
	  currently limited to approximately 30 characters.  Lines are
	  limited to 1024 characters.  Records are limited to 48K, and
	  may be truncated if they are larger than that.  The limit of
	  record length can be changed by modifying the parameter
	  Max_record in agrep.h.

	  Each line in .glimpse_exclude or .glimpse_include that
	  contains a * or a ? must not exceed 30 characters length.

	  Glimpseindex does not index words of size > 64.

	  A medium-size index (-b) may lead to actually slower query
	  times if the files are all very small.

	  Under -b, it may be impossible to make the stop list empty.
	  Glimpseindex is using the "sort" routine, and all
	  occurrences of a word appear at some point on one line.
	  Sort is limiting the size of lines it can handle (the value
	  depends on the platform; ours is 16KB).  If the lines are
	  too big, the word is added to the stop list.

     BUGS
	  Please send bug reports or comments to
	  glimpse@cs.arizona.edu.

     AUTHORS
	  Udi Manber and Burra Gopal, Department of Computer Science,
	  University of Arizona, and Sun Wu, the National Chung-Cheng
	  University, Taiwan. (Email:  glimpse@cs.arizona.edu)

     Page 10					     (printed 11/3/95)

[top]

List of man pages available for IRIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net