Finding Things
Overview
Teaching: 30 min
Exercises: 20 minQuestions
How can I find files?
How can I find things in files?
Objectives
Use
grepto select lines from text files that match simple patterns.Use
findto find files and directories whose names match simple patterns.Use the output of one command as the command-line argument(s) to another command.
Explain what is meant by ‘text’ and ‘binary’ files, and why many common tools don’t handle the latter well.
In the same way that many of us now use ‘Google’ as a verb meaning ‘to find’, Unix programmers often use the word ‘grep’. ‘grep’ is a contraction of ‘global/regular expression/print’, a common sequence of operations in early Unix text editors. It is also the name of a very useful command-line program.
grep finds and prints lines in files that match a pattern.
For our examples,
we will use a file that contains three haiku taken from a
1998 competition
in Salon magazine (Credit to authors Joy Rothke, Howard Korder, and
Margaret Segall, respectively. See
Haiku Error Messsages archived
Page 1
and
Page 2
.). For this set of examples,
we’re going to be working in the writing subdirectory:
$ cd
$ cd Desktop/shell-lesson-data/exercise-data/writing
$ cat haiku.txt
The Web site you seek
cannot be located but
endless others exist.
With searching comes loss
and the presence of absence:
"My Thesis" not found.
Yesterday it worked
Today it is not working
Software is like that.
Let’s find lines that contain the word ‘not’:
$ grep not haiku.txt
cannot be located but
"My Thesis" not found.
Today it is not working
Here, not is the pattern we’re searching for.
The grep command searches through the file, looking for matches to the pattern specified.
To use it type grep, then the pattern we’re searching for and finally
the name of the file (or files) we’re searching in.
The output is the three lines in the file that contain the letters ‘not’.
By default, grep searches for a pattern in a case-sensitive way. In addition, the search pattern we have selected does not have to form a complete word, as we will see in the next example.
Let’s search for the pattern: ‘The’.
$ grep The haiku.txt
The Web site you seek
"My Thesis" not found.
This time, two lines that include the letters ‘The’ are outputted, one of which contained our search pattern within a larger word, ‘Thesis’.
To restrict matches to lines containing the word ‘The’ on its own,
we can give grep with the -w option.
This will limit matches to word boundaries.
Later in this lesson, we will also see how we can change the search behavior of grep with respect to its case sensitivity.
$ grep -w The haiku.txt
The Web site you seek
Note that a ‘word boundary’ includes the start and end of a line, so not
just letters surrounded by spaces.
Sometimes we don’t
want to search for a single word, but a phrase. This is also easy to do with
grep by putting the phrase in quotes.
$ grep -w "is not" haiku.txt
Today it is not working
We’ve now seen that you don’t have to have quotes around single words, but it is useful to use quotes when searching for multiple words. It also helps to make it easier to distinguish between the search term or phrase and the file being searched. We will use quotes in the remaining examples.
Another useful option is -n, which numbers the lines that match:
$ grep -n "it" haiku.txt
1:The Web site you seek
5:With searching comes loss
9:Yesterday it worked
10:Today it is not working
Here, we can see that lines 1, 5, 9, and 10 contain the letters ‘it’.
We can combine options (i.e. flags) as we do with other Unix commands.
For example, let’s find the lines that contain the word ‘the’.
We can combine the option -w to find the lines that contain the word ‘the’
and -n to number the lines that match:
$ grep -n -w "the" haiku.txt
6:and the presence of absence:
Now we want to use the option -i to make our search case-insensitive:
$ grep -n -w -i "the" haiku.txt
1:The Web site you seek
6:and the presence of absence:
Now, we want to use the option -v to invert our search, i.e., we want to output
the lines that do not contain the word ‘the’.
$ grep -n -w -v "the" haiku.txt
1:The Web site you seek
2:cannot be located but
3:endless others exist.
4:
5:With searching comes loss
7:"My Thesis" not found.
8:
9:Yesterday it worked
10:Today it is not working
11:Software is like that.
If we use the -r (recursive) option,
grep can search for a pattern recursively through a set of files in subdirectories.
Let’s search recursively for Yesterday in the shell-lesson-data/exercise-data/writing directory:
$ grep -r Yesterday .
./haiku.txt:Yesterday it worked
./LittleWomen.txt:"Yesterday, when Aunt was asleep and I was trying to be as still as a
./LittleWomen.txt:Yesterday at dinner, when an Austrian officer stared at us and then
./LittleWomen.txt:Yesterday was a quiet day spent in teaching, sewing, and writing in my
grep has lots of other options. To find out what they are, we can type:
$ grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
-F, --fixed-strings PATTERN is a set of newline-separated fixed strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
-e, --regexp=PATTERN use PATTERN for matching
-f, --file=FILE obtain PATTERN from FILE
-i, --ignore-case ignore case distinctions
-w, --word-regexp force PATTERN to match only whole words
-x, --line-regexp force PATTERN to match only whole lines
-z, --null-data a data line ends in 0 byte, not newline
Miscellaneous:
... ... ...
Using
grepWhich command would result in the following output:
and the presence of absence:
grep "of" haiku.txtgrep -E "of" haiku.txtgrep -w "of" haiku.txtgrep -i "of" haiku.txtSolution
The correct answer is 3, because the
-woption looks only for whole-word matches. The other options will also match ‘of’ when part of another word (in this case, the wordSoftware).
Wildcards
grep’s real power doesn’t come from its options, though; it comes from the fact that patterns can include wildcards. (The technical name for these is regular expressions, which is what the ‘re’ in ‘grep’ stands for.) Regular expressions are both complex and powerful; if you want to do complex searches, please look at the lesson on our website. As a taster, we can find lines that have an ‘o’ in the second position like this:$ grep -E "^.o" haiku.txtToday it is not working Software is like that.We use the
-Eoption and put the pattern in quotes to prevent the shell from trying to interpret it. (If the pattern contained a*, for example, the shell would try to expand it before runninggrep.) The^in the pattern anchors the match to the start of the line. The.matches a single character (just like?in the shell), while theomatches an actual ‘o’.
Little Women
You and your friend, having just finished reading Little Women by Louisa May Alcott, are in an argument. Of the four sisters in the book, Jo, Meg, Beth, and Amy, your friend thinks that Jo was the most mentioned. You, however, are certain it was Amy. Luckily, you have a file
LittleWomen.txtcontaining the full text of the novel (shell-lesson-data/exercise-data/writing/LittleWomen.txt). Using aforloop, how would you tabulate the number of times each of the four sisters is mentioned?Hint: one solution might employ the commands
grepandwcand a|, while another might utilizegrepoptions. There is often more than one way to solve a programming task, so a particular solution is usually chosen based on a combination of yielding the correct result, elegance, readability, and speed.Solutions
for sis in Jo Meg Beth Amy do echo $sis: grep -ow $sis LittleWomen.txt | wc -l doneAlternative, slightly inferior solution:
for sis in Jo Meg Beth Amy do echo $sis: grep -ocw $sis LittleWomen.txt doneThis solution is inferior because
grep -conly reports the number of lines matched. The total number of matches reported by this method will be lower if there is more than one match per line.Perceptive observers may have noticed that character names sometimes appear in all-uppercase in chapter titles (e.g. ‘MEG GOES TO VANITY FAIR’). If you wanted to count these as well, you could add the
-ioption for case-insensitivity (though in this case, it doesn’t affect the answer to which sister is mentioned most frequently).
While grep finds lines in files,
the find command finds files themselves.
Again,
it has a lot of options;
to show how the simplest ones work, we’ll use the shell-lesson-data/exercise-data
directory tree shown below.
.
├── numbers.txt
├── populations/
│ ├── bowerbird.txt
│ ├── dunnock.txt
│ ├── python.txt
│ ├── shark.txt
│ ├── six-species.csv
| ├── toad.txt
│ └── wildcat.txt
|
└── writing/
├── haiku.txt
└── LittleWomen.txt
The exercise-data directory contains one file, numbers.txt, and two directories:
populations and writing containing various files.
For our first command,
let’s run find . (remember to run this command from the shell-lesson-data/exercise-data folder).
$ find .
.
./numbers.txt
./populations
./populations/bowerbird.txt
./populations/dunnock.txt
./populations/python.txt
./populations/shark.txt
./populations/six-species.csv
./populations/toad.txt
./populations/wildcat.txt
./writing
./writing/haiku.txt
./writing/LittleWomen.txt
As always, the . on its own means the current working directory,
which is where we want our search to start.
find’s output is the names of every file and directory
under the current working directory.
This can seem useless at first but find has many options
to filter the output and in this lesson we will discover some
of them.
The first option in our list is
-type d that means ‘things that are directories’.
Sure enough, find’s output is the names of the five directories (including .):
$ find . -type d
.
./populations
./writing
Notice that the objects find finds are not listed in any particular order.
If we change -type d to -type f,
we get a listing of all the files instead:
$ find . -type f
./numbers.txt
./populations/bowerbird.txt
./populations/dunnock.txt
./populations/python.txt
./populations/shark.txt
./populations/six-species.csv
./populations/toad.txt
./populations/wildcat.txt
./writing/haiku.txt
./writing/LittleWomen.txt
Now let’s try matching by name:
$ find . -name *.txt
./numbers.txt
We expected it to find all the text files,
but it only prints out ./numbers.txt.
The problem is that the shell expands wildcard characters like * before commands run.
Since *.txt in the current directory expands to ./numbers.txt,
the command we actually ran was:
$ find . -name numbers.txt
find did what we asked; we just asked for the wrong thing.
To get what we want,
let’s do what we did with grep:
put *.txt in quotes to prevent the shell from expanding the * wildcard.
This way,
find actually gets the pattern *.txt, not the expanded filename numbers.txt:
$ find . -name "*.txt"
./numbers.txt
./populations/bowerbird.txt
./populations/dunnock.txt
./populations/python.txt
./populations/shark.txt
./populations/toad.txt
./populations/wildcat.txt
./writing/haiku.txt
./writing/LittleWomen.txt
Listing vs. Finding
lsandfindcan be made to do similar things given the right options, but under normal circumstances,lslists everything it can, whilefindsearches for things with certain properties and shows them.
As we said earlier,
the command line’s power lies in combining tools.
We’ve seen how to do that with pipes;
let’s look at another technique.
As we just saw,
find . -name "*.txt" gives us a list of all text files in or below the current directory.
How can we combine that with wc -l to count the lines in all those files?
The simplest way is to put the find command inside $():
$ wc -l $(find . -name "*.txt")
5 ./numbers.txt
3 ./populations/bowerbird.txt
11 ./populations/dunnock.txt
1 ./populations/python.txt
18 ./populations/shark.txt
20 ./populations/toad.txt
4 ./populations/wildcat.txt
11 ./writing/haiku.txt
21022 ./writing/LittleWomen.txt
21095 total
When the shell executes this command,
the first thing it does is run whatever is inside the $().
It then replaces the $() expression with that command’s output.
Since the output of find is the nine filenames ending in .txt – ./numbers.txt, ./populations/bowerbird.txt,
./populations/dunnock.txt, and so on – the shell constructs the command:
$ wc -l ./numbers.txt ./populations/bowerbird.txt ./populations/dunnock.txt ./populations/python.txt ./populations/shark.txt ./populations/toad.txt ./populations/wildcat.txt ./writing/haiku.txt ./writing/LittleWomen.txt
which is what we wanted.
This expansion is exactly what the shell does when it expands wildcards like * and ?,
but lets us use any command we want as our own ‘wildcard’.
It’s very common to use find and grep together.
The first finds files that match a pattern;
the second looks for lines inside those files that match another pattern.
Here, for example, we can find txt files that contain the word “searching”
by looking for the string ‘searching’ in all the .txt files in the current directory:
$ grep "searching" $(find . -name "*.txt")
./writing/haiku.txt:With searching comes loss
./writing/LittleWomen.txt:sitting on the top step, affected to be searching for her book, but was
Matching and Subtracting
The
-voption togrepinverts pattern matching, so that only lines which do not match the pattern are printed. Given that, which of the following commands will find all .txt files inpopulationsexcepttoad.txt? Once you have thought about your answer, you can test the commands in theshell-lesson-data/exercise-datadirectory.
find populations -name "*.txt" | grep -v toadfind populations -name *.txt | grep -v toadgrep -v "toad" $(find populations -name "*.txt")- None of the above.
Solution
Option 1. is correct. Putting the match expression in quotes prevents the shell expanding it, so it gets passed to the
findcommand.Option 2 would also works in this instance if there were no
*.txtfiles in the current directory. In this case, the shell tries to expand*.txtbut finds no match, so the wildcard expression gets passed tofind. (We first encountered this in Episode 3.)Option 3 is incorrect because it searches the contents of the files for lines which do not match ‘toad’, rather than searching the file names.
Binary Files
We have focused exclusively on finding patterns in text files. What if your data is stored as images, in databases, or in some other format?
A handful of tools extend
grepto handle a few non text formats. But a more generalizable approach is to convert the data to text, or extract the text-like elements from the data. Alternatively, we might recognize that the shell and text processing have their limits, and to use another programming language.
findPipeline Reading ComprehensionWrite a short explanatory comment for the following shell script:
wc -l $(find . -name "*.csv") | sort -nSolution
- Find all files with a
.csvextension recursively from the current directory- Count the number of lines each of these files contains
- Sort the output from step 2 numerically
Key Points
findfinds files with specific properties that match patterns.
grepselects lines in files that match patterns.
--helpis an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.
man [command]displays the manual page for a given command.
$([command])inserts a command’s output in place.