Glimpse Manual

Glimpse Manual

Table of Contents


glimpse 2.1 - search quickly through entire file systems


Glimpse (which stands for GLobal IMPlicit SEarch) is an indexing and query system that allows you to search through all your files very quickly. For example, finding 296 lines containing `whitehouse' in 8750 files occupying 104MB took 6 seconds on a SUN Sparc 5. Glimpse supports most of agrep's options (agrep is our powerful version of grep) including approximate matching (e.g., finding misspelled words), Boolean queries, and even some limited forms of regular expressions. It is used in the same way, except that you don't have to specify file names. So, if you are looking for a needle anywhere in your file system, all you have to do is say glimpse needle and all lines containing needle will appear preceded by the file name.

To use glimpse you first need to index your files with glimpseindex, which is typically run every night. glimpseindex -o ~ will index everything at or below your home directory. See man glimpseindex for more details.

Glimpse includes all of agrep and can be used instead of agrep by giving a file name(s) at the end of the command. This will cause glimpse to ignore the index and run agrep as usual. For example, glimpse -1 pattern file is the same as agrep -1 pattern file. We added a new option to agrep: -r searches recursively the directory and everything below it (see agrep options below); it is used only when glimpse reverts to agrep.

Mail to be added to the glimpse mailing list. Mail to report bugs, ask questions, discuss tricks for using glimpse, etc. (this is a moderated mailing list). HTML version of these manual pages can be found in
Also, see the glimpse developers home page in


glimpse [ -(agrep's options) -C -F file_pattern -H directory - J host_name -K port_number -L x -N -T directory -V -W -z ] pattern


We start with simple ways to use glimpse and describe all the options in detail later on. Once an index is built, using glimpseindex, searching for pattern is as easy as saying glimpse pattern

The output of glimpse is similar to that of agrep (or any other grep), except that the name of the file containing the match appears at the beginning of the line by default. The pattern can be any agrep legal pattern including a regular expression or a Boolean query (e.g., searching for Tucson AND Arizona is done by glimpse `Tucson;Arizona').

The speed of glimpse depends mainly on the number and sizes of the files that contain a match and only to a second degree on the total size of all indexed files. If the pattern is reasonably uncommon, then all matches will be reported in a few seconds even if the indexed files total 200MB or more. Some information on how glimpse works and a reference to a detailed article are given below.

Most of agrep (and other grep's) options are supported, including approximate matching. For example,

glimpse -1 `Tuson;Arezona'

will output all lines containing both patterns allowing one spelling error in any of the patterns (either insertion, deletion, or substitution), which in this case is definitely needed.

glimpse -w -i `parent'

specifies case insensitive (-i) and match on complete words ( - w). So `Parent' and `PARENT' will match, `parent/child' will match, but `parenthesis' or `parents' will not match.

The -F option provides a pattern that must match the file name. For example,

glimpse -F `\.c$' needle

will find the pattern needle in all files whose name ends with .c. (Glimpse will first check its index to determine which files may contain the pattern and then run agrep on the file names to further limit the search.) The -F option should not be put at the end after the main pattern (e.g., "glimpse needle -F hay" is incorrect).


The use of glimpse is similar to that of agrep (or any other grep), except that there is no need to specify file names. Most of agrep's (and other greps) options are supported. It is important to have in mind that the search is over many files. Using very common patterns may lead to a huge number of matches. Running glimpse a will work, but will take a long time and will probably output all of the indexed files. We start with the new options, and then list all of agrep's original options (with some additional comments when relevant).

The New Options of Glimpse

prints attribute names. This option applies only to structured data (used with glimpseindex -s); this option was added to support the Harvest project.

tells glimpse to send its queries to glimpseserver. See man glimpseserver for more details.

-F file_pattern
limits the search to those files whose name (including the whole path) matches file_pattern. If file_pattern matches a directory, then all files with this directory on their path will be considered. To limit the search to actual file names, use $ at the end of the pattern. file_pattern can be a regular expression and even a Boolean pattern. (Glimpse simply runs agrep file_pattern on the list of file names obtained from the index to filter the list.) For example,

glimpse -F `src#\.c$' needle

will search for needle in all .c files with src somewhere along the path. The -F file_pattern must appear before the search pattern (e.g., glimpse needle -F `\.c$' will not work). It is possible to use some of agrep's options when matching file names. In this case all options as well as the file_pattern should be in quotes. (-B and -v do not work very well as part of a file_pattern.) For example,

glimpse -F `-1 gopherc' pattern

will allow one spelling error when matching gopherc to the file names (so "gopherrc" and "gopher" will be considered as well).

glimpse -F `-v \.c$' counter

will search for `counter' in all files except for .c files.

-H directory_name
searches for the index and the other .glimpse files in directory_name. The default is the home directory. This option is useful, for example, if several different indexes are maintained for different archives (e.g., one for mail messages, one for source code, one for articles).

-J host_name
used in conjunction with glimpseserver (-C) to connect to one particular server. See man glimpseserver for more details.

-K port_number
used in conjunction with glimpseserver (-C) to connect to one particular server at the specified TCP port number. See man glimpseserver for more details.

-L x
outputs only the first x matches. If - l is used (i.e., only file names are sought), then the limit is on the number of files; otherwise, the limit is on the number of records.

searches only the index (so the search is faster). If -o or -b are used then the result is the number of files that have a potential match plus a prompt to ask if you want to see the file names. (If -y is used, then there is no prompt and the names of the files will be shown.) This could be a way to get the matching file names without even having access to the files themselves. However, because only the index is searched, some potential matches may not be real matches. In other words, with -N you will not miss any file but you may get extra files. For example, since the index stores everything in lower case, a casesensitive query may match a file that has only a caseinsensitive match. Boolean queries may match a file that has all the keywords but not in the same line (indexing with -b allows glimpse to figure out whether the keywords are close, but it cannot figure out from the index whether they are exactly on the same line or in the same record without looking at the file). If the index was not build with -o or -b, then this option outputs the number of blocks matching the pattern. This is useful as an indication of how long the search will take. All files are partitioned into usually 200-250 blocks. The file .glimpse_statistics contains the total number of blocks (or glimpse -N a will give a pretty good estimate; only blocks with no occurrences of `a' will be missed).

-T directory
Use directory as a place where temporary files are built. (Glimpse produces some small temporary files usually in /tmp.) This option is useful mainly in the context of structured queries for the Harvest project, where the temporary files may be non-trivial.

prints the current version of glimpse.

The default for Boolean AND queries is that they cover one record (the default for a record is one line) at a time. For example, glimpse `good;bad' will output all lines containing both `good' and `bad'. The -W option changes the scope of Booleans to be the whole file. Within a file glimpse will output all matches to any of the patterns. So, glimpse -W `good;bad' will output all lines containing `good' or `bad', but only in files that contain both patterns. For structured queries, the scope is always the whole attribute or file.

- z
Allow customizable filtering, using the file .glimpse_filters to perform the programs listed there for each match. The best example is compress/decompress. If .glimpse_filters include the line
*.Z uncompress <
(separated by tabs) then before indexing any file that matches the pattern "*.Z" (same syntax as the one for .glimpse_exclude) the command listed is executed first (assuming input is from stdin, which is why uncompress needs <) and its output (assuming it goes to stdout) is indexed. The file itself is not changed (i.e., it stays compressed). Then if glimpse -z is used, the same program is used on these files on the fly. Any program can be used (we run `exec'). For example, one can filter out parts of files that should not be indexed. Glimpseindex tries to apply all filters in .glimpse_filters in the order they are given. For example, if you want to uncompress a file and then extract some part of it, put the compression command (the example above) first and then another line that specifies the extraction. Note that this can slow down the search because the filters need to be run before files are searched. (See also glimpseindex.)

The Options of Agrep Supported by Glimpse

# is an integer between 1 and 8 specifying the maximum number of errors permitted in finding the approximate matches (the default is zero). Generally, each insertion, deletion, or substitution counts as one error. It is possible to adjust the relative cost of insertions, deletions and substitutions (see -I -D and -S options). Since the index stores only lower case characters, errors of substituting upper case with lower case may be missed (see LIMITATIONS).

Display only the count of matching records. Only files with count > 0 are displayed.

-d `delim'
Define delim to be the separator between two records. The default value is `$', namely a record is by default a line. delim can be a string of size at most 8 (with possible use of ^ and $), but not a regular expression. Text between two delim's, before the first delim, and after the last delim is considered as one record. For example, -d `$$' defines paragraphs as records and -d `^From ` defines mail messages as records. glimpse matches each record separately. This option does not currently work with regular expressions. The -d option is especially useful for Boolean AND queries, because the patterns need not appear in the same line but in the same record. For example, glimpse -F mail -d `^From ` `glimpse;arizona;announcement' will output all mail messages (in their entirety) that have the 3 patterns anywhere in the message (or the header), assuming that files with `mail' in their name contain mail messages. If you want to output a whole file that matches a Boolean pattern, you can use -d `O9g1Xs' (or another garbage pattern). If the delimiter doesn't appear anywhere, the whole file is one record (there is a limit, however, to the size of records, see LIMITATIONS). Glimpse warning: Use this option with care. If the delimiter is set to match mail messages, for example, and glimpse finds the pattern in a regular file, it may not find the delimiter and will therefore output the whole file. (The -t option - see below - can be used to put the delim at the end of the record.)

-e pattern
Same as a simple pattern argument, but useful when the pattern begins with a `-'.

Do not display filenames.

Case-insensitive search - e.g., "A" and "a" are considered equivalent. Glimpse's index stores all patterns in lower case (see LIMITATIONS below).

No symbol in the pattern is treated as a meta character. For example, glimpse -k `a(b|c)*d' will find the occurrences of a(b|c)*d whereas glimpse `a(b|c)*d' will find substrings that match the regular expression `a(b|c)*d'. (The only exception is ^ at the beginning of the pattern and $ at the end of the pattern, which are still interpreted in the usual way. Use \^ or \$ if you need them verbatim.)

Output only the files names that contain a match.

Each matching record (line) is prefixed by its record (line) number in the file.

(This option is valid only when a file name is given and glimpse is used as agrep; it is a new agrep option.) If the file name is a directory name, glimpse will search (recursively) the whole directory and everything below it. Glimpse will not use its index.

Work silently, that is, display nothing except error messages. This is useful for checking the error status.

Output the record starting from the end of delim to (and including) the next delim. This is useful for cases where delim should come at the end of the record. (See warning for the -d option.)

Search for the pattern as a word - i.e., surrounded by non-alphanumeric characters. For example, glimpse -w -1 car will match cars, but not characters and not car10. The non-alphanumeric must surround the match; they cannot be counted as errors. This option does not work with regular expressions.

The pattern must match the whole line. (This option is translated to -w when the index is searched and it is used only when the actual text is searched. It is of limited use in glimpse.)

Do not prompt. Proceed with the match as if the answer to any prompt is y.

- B
Best match mode. (Warning: -B sometimes misses matches. It is safer to specify the number of errors explicitly.) When -B is specified and no exact matches are found, glimpse will continue to search until the closest matches (i.e., the ones with minimum number of errors) are found, at which point the following message will be shown: "the best match contains x errors, there are y matches, output them? (y/n)" This message refers to the number of matches found in the index. There may be many more matches in the actual text (or there may be none if -F is used to filter files). When the -#, -c, or -l options are specified, the -B option is ignored. In general, -B may be slower than -#, but not by very much. Since the index stores only lower case characters, errors of substituting upper case with lower case may be missed (see LIMITATIONS).

Set the cost of a deletion to k (k is a positive integer). This option does not currently work with regular expressions.

Output the (whole) files that contain a match.

Set the cost of an insertion to k (k is a positive integer). This option does not currently work with regular expressions.

Set the cost of a substitution to k (k is a positive integer). This option does not currently work with regular expressions.

The characters `$', `^', `*', `[', `]', `^', `|', `(', `)', `!', and `\' can cause unexpected results when included in the pattern, as these characters are also meaningful to the shell. To avoid these problems, enclose the entire pattern in single quotes, i.e., `pattern'. Do not use double quotes (").


glimpse supports a large variety of patterns, including simple strings, strings with classes of characters, sets of strings, wild cards, and regular expressions (see LIMITATIONS).

Strings are any sequence of characters, including the special symbols `^' for beginning of line and `$' for end of line. The following special characters ( `$', `^', ` * `, `[', `^', `|', `(', `)', `!', and `\' ) as well as the following meta characters special to glimpse (and agrep): `;', `,', `#', `<', `>', `-', and `.', should be preceded by `\' if they are to be matched as regular characters. For example, \^abc\\ corresponds to the string ^abc\, whereas ^abc corresponds to the string abc at the beginning of a line.

Classes of characters
a list of characters inside [] (in order) corresponds to any character from the list. For example, [a-ho-z] is any character between a and h or between o and z. The symbol `^' inside [] complements the list. For example, [^i-n] denote any character in the character set except character `i' to `n'. The symbol `^' thus has two meanings, but this is consistent with egrep. The symbol `.' (don't care) stands for any symbol (except for the newline symbol).

Boolean operations
Glimpse supports an `AND' operation denoted by the symbol `;' an `OR' operation denoted by the symbol `,', or any combination. For example, glimpse `pizza;cheeseburger' will output all lines containing both patterns. glimpse -F `gnu;\.c$' `define;DEFAULT' will output all lines containing both `define' and `DEFAULT' (anywhere in the line, not necessarily in order) in files whose name contains `gnu' and ends with .c. glimpse `{political,computer};science' will match `political science' or `science of computers'.

Wild cards
The symbol `#' is used to denote a sequence of any number (including 0) of arbitrary characters (see LIMITATIONS). The symbol # is equivalent to .* in egrep. In fact, .* will work too, because it is a valid regular expression (see below), but unless this is part of an actual regular expression, # will work faster. (Currently glimpse is experiencing some problems with #.)

Combination of exact and approximate matching Any pattern inside angle brackets <> must match the text exactly even if the match is with errors. For example, <mathemat>ics matches mathematical with one error (replacing the last s with an a), but mathe<matics> does not match mathematical no matter how many errors are allowed. (This option is buggy at the moment.)

Regular expressions
Since the index is word based, a regular expression must match words that appear in the index for glimpse to find it. Glimpse first strips the regular expression from all non-alphabetic characters, and searches the index for all remaining words. It then applies the regular expression matching algorithm to the files found in the index. For example, glimpse `abc.*xyz' will search the index for all files that contain both `abc' and `xyz', and then search directly for `abc.*xyz' in those files. (If you use glimpse -w `abc.*xyz', then `abcxyz' will not be found, because glimpse will think that abc and xyz need to be matches to whole words.) The syntax of regular expressions in glimpse is in general the same as that for agrep. The union operation `|', Kleene closure `*', and parentheses () are all supported. Currently `+' is not supported. Regular expressions are currently limited to approximately 30 characters (generally excluding meta characters). Some options (-d, -w, -t, -x, -D, I, -S) do not currently work with regular expressions. The maximal number of errors for regular expressions that use `*' or `|' is 4. (See LIMITATIONS.)


(Run "glimpse `^glimpse' this-file" to get a list of all examples, some of which were given earlier.)

glimpse -F `haystack.h$' needle
finds all needles in all haystack.h's files.

glimpse -2 -F html Anestesiology
outputs all occurrences of Anestesiology with two errors in files with html somewhere in their full name.

glimpse -l -F `.c$' variablename
lists the names of all .c files that contain variablename (the -l option lists file names rather than output the matched lines).

glimpse -F `mail;1993' `windsurfing;Arizona' finds all lines containing windsurfing and Arizona in all files having `mail' and `1993' somewhere in their full name.

glimpse -F mail `t.j@#uk'
finds all mail addresses (search only files with mail somewhere in their name) from the uk, where the login name ends with t.j, where the . stands for any one character. (This is very useful to find a login name of someone whose middle name you don't know.)

glimpse -F mbox -h -G . > MBOX
concatenates all files whose name matches `mbox' into one big one.


Glimpse version 2.1 includes an optional new compression program, called cast, which allows glimpse (and agrep) to search the compressed files without having to decompress them. The search is actually significantly faster when the files are compressed. However, we have not tested cast as thoroughly as we would have liked, and a mishap in a compression algorithm can cause loss of data, so we recommend at this point to use cast very carefully. (Unless you specifically use cast, the default is to ignore it.)


All files used by glimpse are located at the directory(ies) where the index(es) is (are) stored and have .glimpse_ as a prefix. The first two files (.glimpse_exclude and .glimpse_include) are optionally supplied by the user. The other files are built and read by glimpse.

contains a list of files that glimpseindex is explicitly told to ignore. The files in this list must either appear in their complete path name, or with the use of the wild cards arbitrary character). When in doubt as to how to write the complete path, use -w to get a list of big files and use the same path (or look at .glimpse_filenames). Notice that, although the index itself will not be indexed, the list of file names (.glimpse_filenames) will be indexed unless it is explicitly listed in .glimpse_exclude.

See the description above for the -z option.

contains a list of files that glimpseindex is explicitly told to include in the index even though they may look like non-text files. Symbolic links are followed by glimpseindex only if they are specifically included here. Again, the names of these files must include the complete path (or *). If a file is in both .glimpse_exclude and .glimpse_include it will be excluded.

contains the list of all indexed file names, one per line. This is an ASCII file that can also be used with agrep to search for a file name leading to a fast find command. For example, glimpse `count#\.c$' ~/.glimpse_filenames will output the names of all (indexed) .c files that have `count' in their name (including anywhere on the path from the index). Setting the following alias in the .login file may be useful: alias findfile `glimpse -h :1 ~/.glimpse_filenames'

contains the index. The index consists of lines, each starting with a word followed by a list of block numbers (unless the -o or -b options are used, in which case each word is followed by an offset into the file .glimpse_partitions where all pointers are kept). The block/file numbers are stored in binary form, so this is not an ASCII file.

contains the output of the -w option (see above).

contains the partition of the indexed space into blocks and, when the index is built with the -o or -b options, some part of the index. This file is used internally by glimpse and it is a non-ASCII file.

contains some statistics about the makeup of the index. Useful for some advanced applications and customization of glimpse.


1. U. Manber and S. Wu, "GLIMPSE: A Tool to Search Through Entire File Systems," Usenix Winter 1994 Technical Conference, San Francisco (January 1994), pp. 23 - 32. Also, Technical Report #TR 93-34, Dept. of Computer Science, University of Arizona, October 1993 (a postscript file is available by anonymous ftp at

2. S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Communications of the ACM 35 (October 1992), pp. 83-91.


glimpseindex(1), glimpseserver(1),


The index of glimpse is word based. A pattern that contains more than one word cannot be found in the index. The way glimpse overcomes this weakness is by splitting any multiword pattern into its set of words and looking for all of them in the index. For example, glimpse `linear programming' will first consult the index to find all files containing both linear and programming, and then apply agrep to find the combined pattern. This is usually an effective solution, but it can be slow for cases where both words are very common, but their combination is not.

As was mentioned in the section on PATTERNS above, some characters serve as meta characters for glimpse and need to be preceded by `\' to search for them. The most common examples are the characters `.' (which stands for a wild card), and `*' (the Kleene closure). So, "glimpse" will match abcde, but "glimpse ab\.de" will not, and "glimpse ab*de" will not match ab*de, but "glimpse ab\*de" will. The meta character - is translated automatically to a hypen unless it appears between [] (in which case it denotes a range of characters).

The index of glimpse stores all patterns in lower case. When glimpse searches the index it first converts all patterns to lower case, finds the appropriate files, and then searches the actual files using the original patterns. So, for example, glimpse ABCXYZ will first find all files containing abcxyz in any combination of lower and upper cases, and then searches these files directly, so only the right cases will be found. One problem with this approach is discovering misspellings that are caused by wrong cases. For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz (because the pattern is converted to lower case); it will find that there are matches with no errors, and will go to those files to search them directly, this time with the original upper cases. If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error. Another problem is speed. If you search for "ATT", it will look at the index for "att". Unless you use -w to match the whole word, glimpse may have to search all files containing, for example, "Seattle" which has "att" in it.

There is no size limit for simple patterns and simple patterns within Boolean expressions. More complicated patterns, such as regular expressions, are currently limited to approximately 30 characters. Lines are limited to 1024 characters. Records are limited to 48K, and may be truncated if they are larger than that. The limit of record length can be changed by modifying the parameter Max_record in agrep.h.

Glimpseindex does not index words of size > 24.


A Boolean AND query that includes two patterns one of which is a prefix of the other (or equal to the other) may not work correctly. Essentially glimpse will find the smallest pattern first, but will not backtrack to try to check again if it matches another pattern. (We are not sure whether this is a bug or a feature, because there is no apparent reason to have patterns like that.)

A Boolean query with a pattern of length 1 (i.e., one character only) may miss matches.

In some rare cases, regular expressions using * or # may not match correctly.

A query that contains no alphanumeric characters is not recommended (unless glimpse is used as agrep and the file names are provided). This is an understatement.

Please send bug reports or comments to


Exit status is 0 if any matches are found, 1 if none, 2 for syntax errors or inaccessible files.


Udi Manber and Burra Gopal, Department of Computer Science, University of Arizona, and Sun Wu, the National Chung-Cheng University, Taiwan. (Email: