Difference between revisions of "Dynamic Text Processing"
(Write up some examples.) |
(Start transcribing sed information.) |
||
Line 34: | Line 34: | ||
| Matches zero or more repetitions of the given pattern | | Matches zero or more repetitions of the given pattern | ||
|- | |- | ||
− | | '''[^'''''<characters>''''']''' | + | | '''^[^'''''<characters>''''']*$''' |
| Matches lines that do ''not'' have the characters listed in ''<characters>'' | | Matches lines that do ''not'' have the characters listed in ''<characters>'' | ||
|} | |} | ||
Line 90: | Line 90: | ||
zoundz! | zoundz! | ||
Remember again, though, that we are case-sensitive by default, so you need to include '''Q''' and '''Z''' in your pattern if you want to factor in capital letters. | Remember again, though, that we are case-sensitive by default, so you need to include '''Q''' and '''Z''' in your pattern if you want to factor in capital letters. | ||
+ | |||
+ | == Modifying Text: sed == | ||
+ | |||
+ | If '''grep''' is equivalent to the ''Find'' feature on many everyday applications, then '''sed''' is like ''Search and Replace'', but on steroids. '''sed''' stands for '''''s'''tream '''ed'''itor''—a name that makes sense because we’ve talked about data as something that “flows” from one program to another via pipes. | ||
+ | |||
+ | '''sed'''’s main function is to take a line of text, then modify it according to some rule that you give it. As with '''grep''', there are ''lots'' of rules that you can specify, but for this section we will highlight just a subset. | ||
+ | |||
+ | This is how a '''sed''' invocation looks: | ||
+ | sed "<rule>" | ||
+ | As you might expect, this '''sed''' invocation is intended to be used as part of a pipe, such as: | ||
+ | cat <filename> | grep "<pattern>" | sed "<rule>" | ||
+ | The above pipe will extract the lines in ''filename'' that match ''pattern'', then make '''sed''' modify those lines according to the given ''rule''. | ||
+ | |||
+ | When used by itself, as you should be expecting by now, <code>sed "<rule>"</code> then just modifies the lines that you type in on the fly, until you hit <kbd>Control-d</kbd>. | ||
+ | |||
+ | * '''Note:''' '''sed''' by itself does ''not'' make any permanent changes to files, so don’t worry about messing up any of the data that you’re using. All examples just ''display'' the edited text. Saving to a file is a matter of using output redirection ('''>'''). | ||
+ | |||
+ | === Replacing Text That Fits a Pattern === | ||
+ | |||
+ | Perhaps the most commonly-used modification rule in '''sed''' is '''s/<pattern>/<replacement>/g'''. The use of the term ''pattern'' is no accident—the patterns that '''sed''' recognizes for matching text are nearly identical to those recognized by '''grep'''. ''replacement'' then takes the place of those matched patterns of text: | ||
+ | sed "s/<pattern>/<replacement>/g" | ||
+ | Give this a shot: | ||
+ | sed "s/Hello/Goodbye/g" | ||
+ | If you then type ''Hello'', ''hi'', ''bye'', and ''Hello World!'' as individual lines, '''sed''' interjects its output to produce this: | ||
+ | Hello | ||
+ | Goodbye | ||
+ | hi | ||
+ | hi | ||
+ | bye | ||
+ | bye | ||
+ | Hello World! | ||
+ | Goodbye World! | ||
+ | Observe that, unlike '''grep''', '''sed''' ''does'' repeat every line you type, regardless of whether or not that line matches the pattern. The difference is that, when there ''is'' a match, '''sed''' performs the specified replacement. Thus, ''Hello'' becomes ''Goodbye'', and ''Hello World!'' becomes ''Goodbye World!'' | ||
+ | |||
+ | The power behind this search-and-replace functionality comes from the patterns that you can use, again very similar to those used by '''grep'''. In addition to exact matches (like the one showed above), you can do: | ||
+ | |||
+ | :{| | ||
+ | | style="width: 10em" | '''.''' | ||
+ | | Matches a single character | ||
+ | |- | ||
+ | | '''['''''<characters>''''']''' | ||
+ | | Matches lines that have ''any'' of the characters listed in ''<characters>'' | ||
+ | |- | ||
+ | | '''^'''''pattern'' | ||
+ | | Matches lines that ''start with'' the given pattern | ||
+ | |- | ||
+ | | ''pattern'''''$''' | ||
+ | | Matches lines that ''end with'' the given pattern | ||
+ | |- | ||
+ | | ''pattern'''''*''' | ||
+ | | Matches zero or more repetitions of the given pattern | ||
+ | |- | ||
+ | | '''[^'''''<characters>''''']*''' | ||
+ | | Matches characters that are not listed in ''<characters>'' | ||
+ | |} | ||
+ | Note how, when we are doing text replacement, the use of '''[^ ]''' is a little more intuitive: | ||
+ | sed "s/[^FLIMVSPTAYHQNKDECWRG-]/*/g" | ||
+ | ...replaces any letter that ''is not'' one of the characters in between the brackets with an asterisk ('''*'''), period. No need to use '''^''' and '''$''' to indicate the beginning and end of the line. | ||
+ | |||
+ | In terms of replacement, in addition to replacement with an exact piece of text, you can also ''delete'' the matched text; this is a matter of placing ''nothing'' in between the second set of slashes ('''/'''): | ||
+ | sed "s/Evil//g" | ||
+ | ...deletes the text “Evil” from any input lines that have it. Of course, you can use non-exact patterns, such as this: | ||
+ | sed "s/^..//g" | ||
+ | The above rule unconditionally deletes the first two characters of each input line, no matter what that character is. | ||
+ | |||
+ | Perhaps even more powerful, you can also ''include'' the matched text—even though you don’t know what that is per line—in the replacement. Do this by including an ampersand ('''&''') in the replacement text. The matched text replaces the ampersand in the final output. For example: | ||
+ | sed "s/.../& /g" | ||
+ | ...this will replace any three characters with ''the same'' characters plus a space. Since those three characters will differ from line to line (and in fact many lines will have more than one set of three characters), having '''&''' available lets you keep those while adding a space. | ||
+ | |||
+ | As another example: | ||
+ | sed "s/[aeiou]/&&/g" | ||
+ | ...will double up every lowercase vowel in the input text. | ||
+ | |||
+ | Try ''those'' search-and-replace operations on ''Microsoft Word'' 😅 Not impossible, but probably harder than doing it with '''sed''' (once you learn the patterns). | ||
+ | |||
+ | === Gathering Up a Bunch of Rules in a Single File === | ||
+ | |||
+ | What if you want to perform a whole bunch of search/replace activities on some text data? On the one hand, you can type multiple '''sed''' commands in a pipe. For example, changing all “The”s to “Them” then changing “Bones” to “Brains” may be done this way: | ||
+ | sed "s/The /Them /g" | sed "s/Bones/Brains/g" | ||
+ | When you have a ''lot'' of substitutions to do, it would be a pain to write out a long pipe. For precisely this reason, '''sed''' allows a variation that does not include the actual rule in the command, but ''reads'' the rules from a separate file: | ||
+ | sed -f <file with rules> | ||
+ | The ''file with rules'' is a simple text file, with one '''sed''' substitution rule per line. Invoking ''sed -f <file with rules>'' on a stream of text data is equivalent to performing '''sed''', sequentially, once for every rule in ''file with rules''. It’s mainly a time saver, but a significant one. | ||
+ | |||
+ | === Replacing Characters With Another Set of Characters === | ||
+ | |||
+ | As powerful as '''s/<pattern>/<replacement>/g''' it is, it actually has limitations. For example, what if you wanted to do something similar to a “secret decoder ring,” where, say, every letter becomes the letter after it, and “z” cycles back to “a”? You might think that including a sequence of '''s/<pattern>/<replacement>/g''' rules in a file will do this, but it won’t (for simplicity, we’re only including lowercase): | ||
+ | s/a/b/g | ||
+ | s/b/c/g | ||
+ | s/c/d/g | ||
+ | ... | ||
+ | s/z/a/g | ||
+ | This won’t work: since the replacements are done in sequence, a word like “adios” then becomes “bdios” after the first substitution (i.e., “b” for “a”). Then, when “b” is substituted for “c”, “bdios” then becomes “cdios”—which isn’t what you want. | ||
+ | |||
+ | What we need is a different rule, which ''substitutes multiple letters for a different one in one fell swoop''. This rule does exist in '''sed''', and that is: | ||
+ | y/<original characters>/<new characters>/ | ||
+ | Because the replacement must be one-to-one, there must be as many characters in ''<original characters>'' as there are in ''<new characters>''. With the '''y/<original characters>/<new characters>/''' rule, the “secret decoder ring” becomes possible: | ||
+ | sed "y/abcdefghijklmnopqrstuvwxyz/bcdefghijklmnopqrstuvwxyza/" | ||
+ | As you might expect, ''this'' '''sed''' command will “decode” the message produced by the one above: | ||
+ | sed "y/bcdefghijklmnopqrstuvwxyza/abcdefghijklmnopqrstuvwxyz/" | ||
+ | Inclusion of uppercase letters, plus any other substitutions, are left to you for practice. Do note, however, how '''y/<original characters>/<new characters>/''' is materially different from '''s/<pattern>/<replacement>/g'''. |
Revision as of 05:10, 11 September 2017
As hinted in our Introduction to the Command Line, we actually have more power at our fingertips than one might expect thanks to the command line’s ability to pass a coherent stream of data from one command to another. On this page, we cover two commands that lend themselves particularly well to this approach: grep
and sed
.
Contents
Finding Text: grep
grep finds specific text within its input data according to some pattern. Unfortunately, explaining the name is too complicated for now, so let’s just leave it at grep:
grep "<pattern>"
This will try to find the desired pattern in the lines that you type. If a line matches, it will repeat that line. If it doesn’t match, it will just wait for the next line until you hit Control-d to end your input.
Try this:
grep "Romance"
Then, type any number of lines. Include the word Romance
in some but not others. Notice that the only lines that repeat are the ones with Romance
in them. Notice also that the matching is case-sensitive—i.e., romance
will not match.
Non-Exact Matches
Exact matches are interesting, but most other everyday applications can do this without a problem. Note how we said that grep can match a pattern and not just search text. It turns out that grep can “understand” a wide variety of symbols that represent different patterns of text.
A period (.) represents any single character. Thus, this pattern:
grep "st..r"
...produces all lines that have “st” and “r” with any two symbols in between. So lines with steer
or Fred Astaire
will match, but store
or restart
will not.
Here are some other patterns that you’ll find useful. Needless to say, this is just the tip of the iceberg; as you get more comfortable with grep, you can learn more and more variations for text patterns.
[<characters>] Matches lines that have any of the characters listed in <characters> ^pattern Matches lines that start with the given pattern pattern$ Matches lines that end with the given pattern pattern* Matches zero or more repetitions of the given pattern ^[^<characters>]*$ Matches lines that do not have the characters listed in <characters>
Note the dual use of ^; when within brackets [ ] this means “do not match the characters” but when it is the first symbol of the pattern, it represents the start of a line.
As mentioned, there are many more, but this is a start.
A Few Examples
It’s the patterns that truly reveal grep’s potential power. For example, try this:
grep "[qz]"
Here’s what appears on the screen if the user types “hello world,” “quit bugging me,” “Quit bugging me,” “what's up,” “Zounds!,” “zoundz!,” then Control-d:
hello world quit bugging me quit bugging me Quit bugging me what's up Zounds! zoundz! zoundz!
Since only “quit bugging me” and “zoundz!” match the [qz] pattern, then only those lines are repeated by grep.
Negations ([^ ]) may seem unintuitive at first but after some consideration their behavior does make sense:
grep "[^qz]"
At first, one might think that this will match data that have neither q nor z within. However, this is not the case:
hello world hello world quit bugging me quit bugging me Quit bugging me Quit bugging me what's up what's up Zounds! Zounds! zoundz! zoundz!
That’s because if you have any character that isn’t a q nor z, then grep considers that to be a match. Only data that consists entirely of qs and zs will not match:
qqqq zzzzzzz qzqzzzq qq
The key to matching data that don’t contain those characters at all is to combine them with ^, *, and $:
grep "^[^qz]*$"
This pattern says that no character from the beginning to the end of the line may be a q nor z:
hello world hello world quit bugging me Quit bugging me Quit bugging me what's up what's up Zounds! Zounds! zoundz!
Remember again, though, that we are case-sensitive by default, so you need to include Q and Z in your pattern if you want to factor in capital letters.
Modifying Text: sed
If grep is equivalent to the Find feature on many everyday applications, then sed is like Search and Replace, but on steroids. sed stands for stream editor—a name that makes sense because we’ve talked about data as something that “flows” from one program to another via pipes.
sed’s main function is to take a line of text, then modify it according to some rule that you give it. As with grep, there are lots of rules that you can specify, but for this section we will highlight just a subset.
This is how a sed invocation looks:
sed "<rule>"
As you might expect, this sed invocation is intended to be used as part of a pipe, such as:
cat <filename> | grep "<pattern>" | sed "<rule>"
The above pipe will extract the lines in filename that match pattern, then make sed modify those lines according to the given rule.
When used by itself, as you should be expecting by now, sed "<rule>"
then just modifies the lines that you type in on the fly, until you hit Control-d.
- Note: sed by itself does not make any permanent changes to files, so don’t worry about messing up any of the data that you’re using. All examples just display the edited text. Saving to a file is a matter of using output redirection (>).
Replacing Text That Fits a Pattern
Perhaps the most commonly-used modification rule in sed is s/<pattern>/<replacement>/g. The use of the term pattern is no accident—the patterns that sed recognizes for matching text are nearly identical to those recognized by grep. replacement then takes the place of those matched patterns of text:
sed "s/<pattern>/<replacement>/g"
Give this a shot:
sed "s/Hello/Goodbye/g"
If you then type Hello, hi, bye, and Hello World! as individual lines, sed interjects its output to produce this:
Hello Goodbye hi hi bye bye Hello World! Goodbye World!
Observe that, unlike grep, sed does repeat every line you type, regardless of whether or not that line matches the pattern. The difference is that, when there is a match, sed performs the specified replacement. Thus, Hello becomes Goodbye, and Hello World! becomes Goodbye World!
The power behind this search-and-replace functionality comes from the patterns that you can use, again very similar to those used by grep. In addition to exact matches (like the one showed above), you can do:
. Matches a single character [<characters>] Matches lines that have any of the characters listed in <characters> ^pattern Matches lines that start with the given pattern pattern$ Matches lines that end with the given pattern pattern* Matches zero or more repetitions of the given pattern [^<characters>]* Matches characters that are not listed in <characters>
Note how, when we are doing text replacement, the use of [^ ] is a little more intuitive:
sed "s/[^FLIMVSPTAYHQNKDECWRG-]/*/g"
...replaces any letter that is not one of the characters in between the brackets with an asterisk (*), period. No need to use ^ and $ to indicate the beginning and end of the line.
In terms of replacement, in addition to replacement with an exact piece of text, you can also delete the matched text; this is a matter of placing nothing in between the second set of slashes (/):
sed "s/Evil//g"
...deletes the text “Evil” from any input lines that have it. Of course, you can use non-exact patterns, such as this:
sed "s/^..//g"
The above rule unconditionally deletes the first two characters of each input line, no matter what that character is.
Perhaps even more powerful, you can also include the matched text—even though you don’t know what that is per line—in the replacement. Do this by including an ampersand (&) in the replacement text. The matched text replaces the ampersand in the final output. For example:
sed "s/.../& /g"
...this will replace any three characters with the same characters plus a space. Since those three characters will differ from line to line (and in fact many lines will have more than one set of three characters), having & available lets you keep those while adding a space.
As another example:
sed "s/[aeiou]/&&/g"
...will double up every lowercase vowel in the input text.
Try those search-and-replace operations on Microsoft Word 😅 Not impossible, but probably harder than doing it with sed (once you learn the patterns).
Gathering Up a Bunch of Rules in a Single File
What if you want to perform a whole bunch of search/replace activities on some text data? On the one hand, you can type multiple sed commands in a pipe. For example, changing all “The”s to “Them” then changing “Bones” to “Brains” may be done this way:
sed "s/The /Them /g" | sed "s/Bones/Brains/g"
When you have a lot of substitutions to do, it would be a pain to write out a long pipe. For precisely this reason, sed allows a variation that does not include the actual rule in the command, but reads the rules from a separate file:
sed -f <file with rules>
The file with rules is a simple text file, with one sed substitution rule per line. Invoking sed -f <file with rules> on a stream of text data is equivalent to performing sed, sequentially, once for every rule in file with rules. It’s mainly a time saver, but a significant one.
Replacing Characters With Another Set of Characters
As powerful as s/<pattern>/<replacement>/g it is, it actually has limitations. For example, what if you wanted to do something similar to a “secret decoder ring,” where, say, every letter becomes the letter after it, and “z” cycles back to “a”? You might think that including a sequence of s/<pattern>/<replacement>/g rules in a file will do this, but it won’t (for simplicity, we’re only including lowercase):
s/a/b/g s/b/c/g s/c/d/g ... s/z/a/g
This won’t work: since the replacements are done in sequence, a word like “adios” then becomes “bdios” after the first substitution (i.e., “b” for “a”). Then, when “b” is substituted for “c”, “bdios” then becomes “cdios”—which isn’t what you want.
What we need is a different rule, which substitutes multiple letters for a different one in one fell swoop. This rule does exist in sed, and that is:
y/<original characters>/<new characters>/
Because the replacement must be one-to-one, there must be as many characters in <original characters> as there are in <new characters>. With the y/<original characters>/<new characters>/ rule, the “secret decoder ring” becomes possible:
sed "y/abcdefghijklmnopqrstuvwxyz/bcdefghijklmnopqrstuvwxyza/"
As you might expect, this sed command will “decode” the message produced by the one above:
sed "y/bcdefghijklmnopqrstuvwxyza/abcdefghijklmnopqrstuvwxyz/"
Inclusion of uppercase letters, plus any other substitutions, are left to you for practice. Do note, however, how y/<original characters>/<new characters>/ is materially different from s/<pattern>/<replacement>/g.