Difference between revisions of "Dynamic Text Processing"

From LMU BioDB 2017
Jump to: navigation, search
(Start transcribing text processing page.)
 
(Write up some examples.)
Line 12: Line 12:
 
Then, type any number of lines. Include the word <code>Romance</code> in some but not others. Notice that the only lines that repeat are the ones with <code>Romance</code> in them. Notice also that the matching is ''case-sensitive''—i.e., <code>romance</code> will not match.
 
Then, type any number of lines. Include the word <code>Romance</code> in some but not others. Notice that the only lines that repeat are the ones with <code>Romance</code> in them. Notice also that the matching is ''case-sensitive''—i.e., <code>romance</code> will not match.
  
===== Non-Exact Matches =====
+
=== Non-Exact Matches ===
  
 
Exact matches are interesting, but most other everyday applications can do this without a problem.  Note how we said that '''grep''' can match a ''pattern'' and not just ''search text''.  It turns out that '''grep''' can “understand” a ''wide'' variety of symbols that represent different patterns of text.
 
Exact matches are interesting, but most other everyday applications can do this without a problem.  Note how we said that '''grep''' can match a ''pattern'' and not just ''search text''.  It turns out that '''grep''' can “understand” a ''wide'' variety of symbols that represent different patterns of text.
Line 40: Line 40:
  
 
As mentioned, there are many more, but this is a start.
 
As mentioned, there are many more, but this is a start.
 +
 +
=== A Few Examples ===
 +
 +
It’s the patterns that truly reveal '''grep'''’s potential power. For example, try this:
 +
grep "[qz]"
 +
Here’s what appears on the screen if the user types “hello world,” “quit bugging me,” “Quit bugging me,” “what's up,” “Zounds!,” “zoundz!,” then <kbd>Control-d</kbd>:
 +
hello world
 +
quit bugging me
 +
quit bugging me
 +
Quit bugging me
 +
what's up
 +
Zounds!
 +
zoundz!
 +
zoundz!
 +
Since only “quit bugging me” and “zoundz!” match the '''[qz]''' pattern, then only those lines are repeated by '''grep'''.
 +
 +
Negations ('''[^ ]''') may seem unintuitive at first but after some consideration their behavior does make sense:
 +
grep "[^qz]"
 +
At first, one might think that this will match data that have neither '''q''' nor '''z''' within. However, this is not the case:
 +
hello world
 +
hello world
 +
quit bugging me
 +
quit bugging me
 +
Quit bugging me
 +
Quit bugging me
 +
what's up
 +
what's up
 +
Zounds!
 +
Zounds!
 +
zoundz!
 +
zoundz!
 +
That’s because if you have ''any'' character that isn’t a '''q''' nor '''z''', then '''grep''' considers that to be a match. Only data that consists ''entirely'' of '''q'''s and '''z'''s will not match:
 +
qqqq
 +
zzzzzzz
 +
qzqzzzq
 +
qq
 +
The key to matching data that don’t contain those characters at all is to combine them with '''^''', '''*''', and '''$''':
 +
grep "^[^qz]*$"
 +
This pattern says that ''no'' character from the beginning to the end of the line may be a '''q''' nor '''z''':
 +
hello world
 +
hello world
 +
quit bugging me
 +
Quit bugging me
 +
Quit bugging me
 +
what's up
 +
what's up
 +
Zounds!
 +
Zounds!
 +
zoundz!
 +
Remember again, though, that we are case-sensitive by default, so you need to include '''Q''' and '''Z''' in your pattern if you want to factor in capital letters.

Revision as of 04:53, 11 September 2017

As hinted in our Introduction to the Command Line, we actually have more power at our fingertips than one might expect thanks to the command line’s ability to pass a coherent stream of data from one command to another. On this page, we cover two commands that lend themselves particularly well to this approach: grep and sed.

Finding Text: grep

grep finds specific text within its input data according to some pattern. Unfortunately, explaining the name is too complicated for now, so let’s just leave it at grep:

grep "<pattern>"

This will try to find the desired pattern in the lines that you type. If a line matches, it will repeat that line. If it doesn’t match, it will just wait for the next line until you hit Control-d to end your input.

Try this:

grep "Romance"

Then, type any number of lines. Include the word Romance in some but not others. Notice that the only lines that repeat are the ones with Romance in them. Notice also that the matching is case-sensitive—i.e., romance will not match.

Non-Exact Matches

Exact matches are interesting, but most other everyday applications can do this without a problem. Note how we said that grep can match a pattern and not just search text. It turns out that grep can “understand” a wide variety of symbols that represent different patterns of text.

A period (.) represents any single character. Thus, this pattern:

grep "st..r"

...produces all lines that have “st” and “r” with any two symbols in between. So lines with steer or Fred Astaire will match, but store or restart will not.

Here are some other patterns that you’ll find useful. Needless to say, this is just the tip of the iceberg; as you get more comfortable with grep, you can learn more and more variations for text patterns.

[<characters>] Matches lines that have any of the characters listed in <characters>
^pattern Matches lines that start with the given pattern
pattern$ Matches lines that end with the given pattern
pattern* Matches zero or more repetitions of the given pattern
[^<characters>]* Matches lines that do not have the characters listed in <characters>

Note the dual use of ^; when within brackets [ ] this means “do not match the characters” but when it is the first symbol of the pattern, it represents the start of a line.

As mentioned, there are many more, but this is a start.

A Few Examples

It’s the patterns that truly reveal grep’s potential power. For example, try this:

grep "[qz]"

Here’s what appears on the screen if the user types “hello world,” “quit bugging me,” “Quit bugging me,” “what's up,” “Zounds!,” “zoundz!,” then Control-d:

hello world
quit bugging me
quit bugging me
Quit bugging me
what's up
Zounds!
zoundz!
zoundz!

Since only “quit bugging me” and “zoundz!” match the [qz] pattern, then only those lines are repeated by grep.

Negations ([^ ]) may seem unintuitive at first but after some consideration their behavior does make sense:

grep "[^qz]"

At first, one might think that this will match data that have neither q nor z within. However, this is not the case:

hello world
hello world
quit bugging me
quit bugging me
Quit bugging me
Quit bugging me
what's up
what's up
Zounds!
Zounds!
zoundz!
zoundz!

That’s because if you have any character that isn’t a q nor z, then grep considers that to be a match. Only data that consists entirely of qs and zs will not match:

qqqq
zzzzzzz
qzqzzzq
qq

The key to matching data that don’t contain those characters at all is to combine them with ^, *, and $:

grep "^[^qz]*$"

This pattern says that no character from the beginning to the end of the line may be a q nor z:

hello world
hello world
quit bugging me
Quit bugging me
Quit bugging me
what's up
what's up
Zounds!
Zounds!
zoundz!

Remember again, though, that we are case-sensitive by default, so you need to include Q and Z in your pattern if you want to factor in capital letters.