Difference between revisions of "PostgreSQL Tutorial"

From LMU BioDB 2013
Jump to: navigation, search
(Write up creation instructions.)
(Final cleanup, streamlining.)
 
(16 intermediate revisions by one user not shown)
Line 28: Line 28:
 
To start using a database, click on its icon.  The red '''x''' disappears from the database icon and you should now be able to work.
 
To start using a database, click on its icon.  The red '''x''' disappears from the database icon and you should now be able to work.
  
== An Introduction to SQL ==
+
== Walkthrough: Loading the Sample Movie Table Into Your Database ==
  
Of course, “true” database activities are triggered via ''SQL'' commandsThe previous ''SQL'' PDF handout gives you more of a reference/nutshell view of SQL; this page walks you through some commands step-by-step, using the kerfuffle database that has been preloaded into the Keck lab for you.  Then, you’ll practice making your own tables within that database.
+
Before we can dive into SQL, we need to set up some information that we can accessIn doing this, you will see how data in a plain text file can find its way into a full-fledged relational table.
  
=== Select Queries ===
+
We will load up the ''movie_titles.txt'' file in ''~dondi/xmlpipedb/data'' into your own database.  If you '''cat''' that file, you will see that it looks like this:
  
The ''R'' in CRUD, “retrieve,” is, somewhat inconsistently, performed by the ''select'' command. The ''select'' command is the SQL “kitchen sink” for retrieving information from a database. It’s general, basic form is:
+
17761,2003,Levity
 +
17762,1997,Gattaca
 +
  17763,1978,Interiors
 +
17764,1998,Shakespeare in Love
 +
17765,1969,Godzilla's Revenge
 +
17766,2002,Where the Wild Things Are and Other Maurice Sendak Stories
 +
  17767,2004,Fidel Castro: American Experience
 +
17768,2000,Epoch
 +
17769,2003,The Company
 +
17770,2003,Alien Hunter
  
select <columns> from <tables> where <conditions>;
+
(this is from the end of the file)
  
As you will see, ''select'' can do even more, but let’s start simple.
+
Looking at the information, we recognize that this file consists of a movie ID, a year, and a title.  Thus, we need to prepare a table with these columns in our database.
  
For this portion, make sure that you are connected to the ''kerfuffle'' database.
+
=== Create the Movie Table ===
  
==== Basics ====
+
Switching back to '''pgAdmin III''', click the SQL button in the toolbar.  A new window with an '''SQL Editor''' tab appears.  The following command will create your ''movie'' table; type this into that tab:
  
The simple (though large) ''kerfuffle'' database consists of two tables: ''movie'' and ''rating''. The ''movie'' table is derived from the same data as the ''movie_titles.txt'' file in the ''~xmlpipedb/data'' directory, and as such holds the same information: numeric movie IDs, release years, and movie titles.
+
  create table movie (id int primary key, year int, title varchar)
  
Due to “impurities” in the data (i.e., some years are marked as “NULL”), the release year had to be loaded as ''text'' and not as ''numbers''.  This prevents us from using numeric conditions like '''>''' and '''<''' on release years, but, as you will see, there are ways around that.  Ideally, data should be “cleaned up” before it’s loaded into a database &mdash; something that may or may not be possible, so we may as well get used to working with “imperfect” data.
+
As always, watch out for typos!  When ready, click on the '''Execute query''' button in the toolbar(its button looks like a green play button)
  
The simplest type of ''select'' queries involve getting records from an individual table based on relatively simple conditions.  For example:
+
Upon executing the query, the following should appear in the '''Messages''' tab of the ''Output Pane'' in the bottom half of the window:
  
  select year from movie where title = 'Metropolis';
+
  NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movie_pkey" for table "movie"
 +
Query returned successfully with no result in 101 ms.
  
...will retrieve the release years of movies whose titles are ''exactly'' “Metropolis.”  If you try this query, you should see two rows: one for 2001 and another for 1927.
+
To assure yourself that the ''movie'' table is indeed there, type and execute this query:
  
In addition to equality, ''like'' lends further flexibility. The ''like'' comparison allows for pattern matching, similar but not identical to '''grep'''.  In SQL, the percent sign (“'''%'''”) is a “wildcard” that can represent any number of letters and symbols.  ''like'' and '''%''' can be combined for broader queries, such as this one, which retrieves all movie titles that have the word “Vampire” in them:
+
  select * from movie
  
select title from movie where title like '%Vampire%';
+
The '''Data Output''' tab of the ''Output Pane'' should now show an empty tabular display with headings for '''id integer''', '''year integer''', and '''title character varying'''.
  
'''psql'''’s data display works a lot like '''more''': if there’s more data than can fit in the command window, '''psql''' will pause and recognize the following keys:
+
=== Prepare Data for Insertion into the Movie Table ===
* '''Enter''' moves through the data one line at a time
+
* The '''space bar''' moves through the data one window at a time
+
* '''q''' exits the data display
+
  
If you want ''select'' to display ''all'' columns of a database record, use the asterisk ('''*'''):
+
At this point, you have a ''movie'' table, but no data: that information currently resides in ''movie_titles.txt'' over in the Keck lab's ''my.cs.lmu.edu'' server.  How do we get that data into our ''movie'' table?
  
select * from movie where year like '196%';
+
Recall that the SQL command for adding data has this format:
  
This will display ''every'' column/field of movie records whose year starts with “196” &mdash; effectively, all movies released in the 1960s.
+
insert into '''''table''''' ('''''columns''''') values ('''''values''''')
  
Conditions can be combined via ''and'' and ''or''; for example, retrieving movies whose title contains ''either'' “DNA” or “Bio” can be done with:
+
In other words, this text data:
  
  select * from movie where title like '%DNA%' or title like '%Bio%';
+
  17761,2003,Levity
 +
17762,1997,Gattaca
 +
17763,1978,Interiors
 +
17764,1998,Shakespeare in Love
 +
17765,1969,Godzilla's Revenge
 +
17766,2002,Where the Wild Things Are and Other Maurice Sendak Stories
 +
17767,2004,Fidel Castro: American Experience
 +
17768,2000,Epoch
 +
17769,2003,The Company
 +
17770,2003,Alien Hunter
  
When using more than two conditions, watch out for how conditions are grouped together; this query, for example, may yield unexpected results:
+
...must be made to look like this:
  
  select * from movie where title like '%DNA%' or title like '%Bio%' and year like '200%';
+
  insert into movie(id, year, title) values (17761,2003,'Levity');
 +
insert into movie(id, year, title) values (17762,1997,'Gattaca');
 +
insert into movie(id, year, title) values (17763,1978,'Interiors');
 +
insert into movie(id, year, title) values (17764,1998,'Shakespeare in Love');
 +
insert into movie(id, year, title) values (17765,1969,'Godzilla''s Revenge');
 +
insert into movie(id, year, title) values (17766,2002,'Where the Wild Things Are and Other Maurice Sendak Stories');
 +
insert into movie(id, year, title) values (17767,2004,'Fidel Castro: American Experience');
 +
insert into movie(id, year, title) values (17768,2000,'Epoch');
 +
insert into movie(id, year, title) values (17769,2003,'The Company');
 +
insert into movie(id, year, title) values (17770,2003,'Alien Hunter');
  
At face value, the above query may read like “movies whose titles have either DNA or Bio which were released on or after 2000;” in reality, the ''year'' criterion is only ''and''-ed with movies that have “Bio” in the title.  Thus, the database actually interprets this query as “movies whose titles have DNA, ''or'' whose titles have Bio which were released on or after 2000.
+
Note a few rules here that might not be very obvious:
 +
* Text values need to be enclosed between single quotes; numbers don't.
 +
* Note, in the case of movie 17765, ''Godzilla's Revenge'', that the ''movie title itself'' contains a single quote (apostrophe)To distinguish an “in-text” apostrophe from a “wrapper” apostrophe, we double it up; that's why the SQL above shows ''Godzilla's Revenge'' with the apostrophe converted into two.
 +
* When performing multiple SQL queries, the semicolon (;) is used to distinguish one query from another.  Think of the semicolon as playing the same role that periods (.) do in plain English sentences.
  
To eliminate any ambiguities, use parentheses to group conditions together:
+
Hmmmmm...do we have a program that can do this?  We need to take the lines in the ''movie_titles.txt'' file and convert them into valid SQL ''insert'' commands.  Why yes we do, and you have already used it: '''sed'''.
  
  select * from movie where (title like '%DNA%' or title like '%Bio%') and year like '200%';
+
Since this is a tutorial, we won't spend time to explain exactly how we come up with the '''sed''' command below. But you should be able to see the pieces:
 +
* We need to turn all single apostrophes into doubles.
 +
* We then need to “wrap” the titles at the end around single quotes.
 +
* Finally, we append the ''insert'' command before each line, and
 +
* end each line with a parenthesis and semicolon.
  
These parentheses force the database to pick out movies with either title first, and ''then'' check if these movies have years beginning with “200.
+
Finally, we somehow need to get the data into the PostgreSQL server running on your workstation.  For this, we will use the built-in web server of the ''my.cs.lmu.edu'' host: we will deposit the results of '''sed''' into a file that a ''web browser'' can then display.  From the web browser, we can copy and paste the ''insert'' statements into the '''pgAdmin III''' SQL window.
  
The ''like'' comparator can do simple text matches, but it does ''not'' use regular expressions (i.e., the search patterns recognized by '''grep''' and '''sed''').  This area is a little shaky in SQL-land; there is an official ''similar to'' comparator which is the official way to make regular expression comparisons, but the format for those expressions is not the same as the format used by '''grep''' and '''sed'''.
+
All that said (no pun intended), this is the command that you want (invoke it from ''~dondi/xmlpipedb/data''):
  
Fortunately, PostgreSQL has a specific, PostgreSQL-only comparator that ''does'' match the same patterns used by '''grep''' and '''sed''': the tilde ('''~'''). Comparing with '''~''' is equivalent to a '''grep'''- or '''sed'''-like comparison:
+
cat movie_titles.txt | sed "s/'/'&apos;/g" | sed "s/,/,'/2" | sed "s/^/insert into movie(id,year,title) values(/g" | sed "s/$/');/g" > ~/public_html/movie.sql.txt
  
select * from movie where title ~ 'Vamp[iy]re';
+
At this point, you can probably work out the '''sed''' commands. The portion we will explain just a little bit more is the last section, <code>&gt; ~/public_html/movie.sql.txt</code>.  We have not needed to use the '''>''' symbol before, but now it is just what we need: it “sends” the result of the prior '''sed''' commands into a file. That file is placed in your ''public_html'' folder, which, if you recall, is visible on the web as http://my.cs.lmu.edu/~your_username/.
  select * from movie where title ~ 'End$';
+
  select * from movie where title ~ 'Colou?r';
+
  
The caveat here is that '''~''' is a PostgreSQL-specific feature: if you move to other database systems (such as Microsoft Access), that feature may either be done differently or missing completely, since it is not part of the official SQL standard.
+
=== Insert! ===
  
==== Sorting ====
+
Finally, we can feed these 17,770 ''insert'' statements (quick, how did we know this?) into PostgreSQL.  Open a new browser tab or window and go to http://my.cs.lmu.edu/~username/movie.sql.txt (remember to substitute ''username'' with your Keck lab ''ssh''/''PuTTY'' login.  You should see your fresh '''sed''' product in the browser.  From here, '''Select All''' and '''Copy''' the commands.
  
As you play with various queries on the ''movie'' table, you’ve probably noticed that results are returned in no particular order; if you’d like to sort the results in some way, tack on an ''order by'' clause at the end of the ''select'' query:
+
Switch to the '''pgAdmin III''' SQL window, empty out the '''SQL Editor''' tab, then '''Paste''' your ''insert'' statements into the tab.  Finally, execute the query (the green play button, remember?) and let it work.  With 17,770 records, this takes a ''little'' bit longer than prior commands that you have run.
  
select * from movie where year like '200%' order by year;
+
In the end, the '''Messages''' tab in the ''Output Pane'' should say something like:
  
This will display the records/rows for movies released from 2000 to 2009, sorted by year. You can add more fields for a very specific sort order:
+
  Query returned successfully: one row affected, 2325 ms execution time.
  
select * from movie where year like '200%' order by year, title;
+
(exact execution times will vary)
  
The above query returns the same records, but this time sorted by year first, then by title ''within'' each year.
+
Once more, check your work; re-execute this:
  
Sort order is ascending by default (e.g., A to Z, 0 to 9); for the reverse order, add ''desc'' to the field(s) that you’d like to see sorted in reverse:
+
select * from movie
  
select * from movie where year like '200%' order by year desc, title;
+
This time, you should see a fully-populated '''Data Output''' tab, with...17,770 rows.
  
This will display records with years displayed most recent first; within each year, however, titles will still be sorted in ascending order.
+
== Walkthrough: Adding a Few More Tables ==
  
==== Joins ====
+
To make our database a little more interesting, will create a couple more tables and load them with some data.  The rest of the tutorial will be based on these tables and their content.
  
Thus far, we’ve only been working with one table in the ''kerfuffle'' database: ''movie''A second (huge) table adds some further interest: ''rating''.  This table stores ratings, from 1 to 5, made by individual members on movies they have seen.  Quick recall: what command should you type in order to see the structure (schema) of the ''rating'' table?
+
At this stage, you’ve already seen these commands in some form, so we’ll skip the explanation and go right to the commandsCopy, paste, and run this (you successfully copied and pasted 17,770 ''insert'' commands previously, so this should be no sweat!):
  
Examination of the ''rating'' table reveals that it has ''movie'', ''member'', ''rating'', and ''rating_date'' fields. Thus, each record consists of a single rating, made by a particular member for a particular movie, and the date of that rating’s submission. Staying with one table for now, this query will list all of the ratings submitted by member no. 6:
+
create table member (id int primary key, name varchar);
 +
create table rating (movie int references movie(id), member int references member(id), rating int);
 +
insert into member(id, name) values (6, 'Natalie');
 +
insert into member(id, name) values (8, 'Boomer');
 +
insert into member(id, name) values (42, 'Doug');
 +
  insert into rating(movie, member, rating) values(209, 6, 5);
 +
insert into rating(movie, member, rating) values(2040, 6, 5);
 +
insert into rating(movie, member, rating) values(6908, 6, 2);
 +
  insert into rating(movie, member, rating) values(2610, 6, 1);
 +
insert into rating(movie, member, rating) values(8809, 6, 2);
 +
insert into rating(movie, member, rating) values(37, 6, 4);
 +
insert into rating(movie, member, rating) values(113, 6, 2);
 +
insert into rating(movie, member, rating) values(8687, 8, 1);
 +
insert into rating(movie, member, rating) values(9628, 8, 5);
 +
insert into rating(movie, member, rating) values(10877, 8, 2);
 +
insert into rating(movie, member, rating) values(12513, 8, 1);
 +
insert into rating(movie, member, rating) values(15923, 8, 2);
 +
insert into rating(movie, member, rating) values(15532, 8, 3);
 +
insert into rating(movie, member, rating) values(4006, 8, 3);
 +
insert into rating(movie, member, rating) values (30, 42, 4);
 +
insert into rating(movie, member, rating) values (113, 42, 5);
 +
insert into rating(movie, member, rating) values (37, 42, 5);
 +
insert into rating(movie, member, rating) values (4765, 42, 5);
 +
insert into rating(movie, member, rating) values (6762, 42, 3);
 +
insert into rating(movie, member, rating) values (6853, 42, 4);
 +
insert into rating(movie, member, rating) values (10176, 42, 4);
 +
insert into rating(movie, member, rating) values (13847, 42, 5);
 +
insert into rating(movie, member, rating) values (15127, 42, 3);
 +
insert into rating(movie, member, rating) values (15532, 42, 1);
  
  select * from rating where member = 6;
+
You should see the customary success messages once the commands are done. Time to play!
  
This returns a good chunk of data, but you may have noticed that the result isn’t quite as meaningful to us, since we get movie ''IDs'' back instead of ''titles''.  These movie titles, however, are in the ''movie'' table, not ''rating''.  We thus need to ''join'' the two tables.  As you might have seen when looking at the schema of the ''rating'' table, the ''movie'' field is a ''foreign key'' to the ''id'' field in the ''movie'' table.  Thus, every ''movie'' value in the ''rating'' table matches some ''id'' in the ''movie'' table, thus leading us to that movie’s title.
+
== An Introduction to SQL Select Queries ==
  
An SQL ''join'' uses the same basic ''select'' command, but requires the tables being joined (in this case ''movie'' and ''rating''), as well as the fields/columns to use for “joining” records. “Joining” records means that their fields are combined to create a new “virtual” record.  All other parts of the ''select'' command retain the same meanings as before:
+
Database activities are triggered via ''SQL'' commands.  The previous ''SQL'' PDF handout gives you more of a reference/nutshell view of SQL; this page walks you through some commands step-by-step, using the sample movie database that you set up earlier in this page.
  
select <columns> from <table1> inner join <table2> on (<join condition>) where <conditions>;
+
The ''select'' command is the SQL “kitchen sink” for retrieving information from a database.  Its general, basic form is:
  
Thus, the previous query, modified to retrieve the titles and ratings of the movies rated by member no. 6, looks like this:
+
select '''''columns''''' from '''''tables''''' where '''''conditions'''''
  
select title, rating from movie inner join rating on (movie.id = rating.movie) where member = 6;
+
As you will see, ''select'' can do even more, but let’s start simple.
  
As in the basic ''select'' command, you can tailor the ''where'' conditions as you need to pull ratings based on other criteria.  For example, ratings for a particular movie, as opposed as for a particular member, can be retrieved by changing the ''where'' clause.  The query below displays all ratings for the movie(s) whose title is ''Love Story'':
+
=== Basics ===
  
select rating from movie inner join rating on(movie.id = rating.movie) where title = 'Love Story';
+
The simplest type of ''select'' queries involve getting records from an individual table based on relatively simple conditions. For example:
  
==== Aggregate Queries ====
+
select year from movie where title = 'Metropolis';
  
There are ''lots'' of ratings in the ''kerfuffle'' database &mdash; so many that just displaying the individual ratings records may not be very usefulFor large databases, SQL provides ''aggregate'' (a.k.a. “grouping”) queries that summarize multiple records in different ways.
+
...will retrieve the release years of movies whose titles are ''exactly'' “Metropolis. If you try this query, you should see two rows: one for 2001 and another for 1927.
  
The simplest form of summary is counting: ''how many'' records were retrieved? A simple overall count is done by using ''count(*)'' as the thing to select (or ''project'', in formal relational algebraic terms):
+
In addition to equality, ''like'' lends further flexibility. The ''like'' comparison allows for pattern matching, similar but not identical to '''grep'''.  In SQL, the percent sign ('''%'''”) is a “wildcard” that can represent any number of letters and symbols.  ''like'' and '''%''' can be combined for broader queries, such as this one, which retrieves all movie titles that have the word “Vampire” in them:
  
  select count(*) from movie where year like '200%';
+
  select title from movie where title like '%Vampire%';
  
This will display the number of movies whose release years start with “200.”
+
If you want ''select'' to display ''all'' columns of a database record, use the asterisk ('''*'''):
  
Aside from grouping an ''overall'' set of records, subgroups can also be processed.  Consider this non-aggregating query:
+
select * from movie where title like '19%';
  
select rating from movie inner join rating on (id = movie) where title = 'The Godfather';
+
This will display ''every'' column/field of movie records whose title starts with “19”&mdash;these mostly appear to be films pertaining to a specific year in the 20th century.
  
This will give you a bunch of 1s, 2s, 3s, 4s, and 5s, as were submitted by members who have seen the film ''The Godfather''.  This is almost information overload &mdash; we don’t really care about the individual ratings, but ''how many'' of each rating was given.  So if you think about it, what we really want is:
+
Conditions can be combined via ''and'' and ''or''; for example, retrieving movies whose title contains ''either'' “DNA” or “Bio” can be done with:
# Gather all of the records for a particular rating (1, 2, 3, 4, or 5)
+
# For each of these ''groups'' of records, perform a count
+
When this type of ''grouping'' is desired, the query changes in these ways:
+
* The columns indicated in the ''select'' clause now list the value that is being grouped, plus the way those groups are ''aggregated'' (in this case, we’d like to count the number of records in each group)
+
* A new clause, the ''group by'' clause, is added after the ''where'' clause to indicate the field whose values should be used as the basis for grouping.
+
  
Thus, finding the number of ratings received by ''The Godfather'', for each of the five possible ratings, is done with this query:
+
select * from movie where title like '%DNA%' or title like '%Bio%';
  
select rating, count(rating) from movie inner join rating on (id = movie) where title = 'The Godfather' group by rating;
+
When using more than two conditions, watch out for how conditions are grouped together; this query, for example, may yield unexpected results:
<!-- Save for later
+
select rating, count(rating) from movie inner join rating on (id = movie) where title = 'The Godfather' group by rating having count(rating) > 8000;
+
-->
+
As before, you can experiment with queries such as these, but for different movies and/or ratings, before moving further down the page.  If you feel like exploring this type of query further, rest assured that there’s a lot more functionality available; either use '''\h select''' or look up the ''select'' command’s ''group by'' clause on the Web.
+
  
<!-- Save for later
+
select * from movie where title like '%DNA%' or title like '%Bio%' and title like '%:%';
==== Making PostgreSQL Explain Itself ====
+
-->
+
  
=== Creating Tables ===
+
At face value, the above query may read like “movies whose titles have either DNA or Bio which also have a colon (''':''');” in reality, the colon criterion is only ''and''-ed with movies that have “Bio” in the title.  Thus, the database actually interprets this query as “movies whose titles have DNA, ''or'' whose titles have Bio and a colon.”
  
Now that you’ve logged some time with ''retrieving'' data, you’re probably aware that someone must have ''put'' the data there in the first place.  First, the table(s) holding the data must be set up, or ''created'' &mdash; this is the ''C'' in CRUD.  Half of table creation actually occurs without a computer: you (or the database designer) must determine the ''schema'' for the table independently of the database.  This schema consists of the fields in the table, the kind of data each field will hold (numbers, text, dates, etc.), and any primary or foreign keys in the table.  Ideally, you should have a complete relational schema diagram, such as the one shown in the handout or worked on previously in class.
+
To eliminate any ambiguities, use parentheses to group conditions together:
  
With this table schema, creating the table in SQL is a matter of converting the diagram into a ''create table'' command:
+
select * from movie where (title like '%DNA%' or title like '%Bio%') and title like '%:%';
  
create table <tablename> (<columns, their data types, and the primary key, separated by commas>);
+
These parentheses force the database to pick out movies with either title first, and ''then'' check if these movies have a colon in their titles.
  
If you’re still connected to the ''kerfuffle'' database, disconnect via '''\q''', then run '''psql''' again, this time all by itself.  This should connect you to your individual database.
+
The ''like'' comparator can do simple text matches, but it does ''not'' use regular expressions (i.e., the search patterns recognized by '''grep''' and '''sed''').  This area is a little shaky in SQL-land; there is an official ''similar to'' comparator which is the official way to make regular expression comparisons, but the format for those expressions is not the same as the format used by '''grep''' and '''sed'''.
  
Suppose you’ve designed a ''person'' table that looks like this:
+
Fortunately, PostgreSQL has a specific, PostgreSQL-only comparator that ''does'' match the same patterns used by '''grep''' and '''sed''': the tilde ('''~''').  Comparing with '''~''' is equivalent to a '''grep'''- or '''sed'''-like comparison:
  
:{| style="width: 10em; border: solid 1px"
+
select * from movie where title ~ 'Vamp[iy]re';
| style="background: lightgray; padding: 1ex" | id
+
select * from movie where title ~ 'End$';
|-
+
select * from movie where title ~ 'Colou?r';
| style="padding: 1ex" | firstname
+
|-
+
| style="padding-left: 1ex; padding-right: 1ex" | lastname
+
|-
+
| style="padding: 1ex" | dob
+
|}
+
  
...where ''id'' should be an integer, ''firstname'' and ''lastname'' can be any text, and ''dob'' is some date.  In SQL, an “integer” is indicated by ''int'', text is indicated by ''varchar'', and a date is indicated by (for once) ''date''.  Further, based on the shading, the primary key of this table is ''id''. Putting this all together yields the following commands:
+
The caveat here is that '''~''' is a PostgreSQL-specific feature: if you move to other database systems (such as Microsoft Access), that feature may either be done differently or missing completely, since it is not part of the official SQL standard.
  
create table person (id int primary key, firstname varchar, lastname varchar, dob date);
+
Numeric values, such as the ''year'' column, can be compared using ''='', ''<'', ''>'', ''<='' (greater than or equal to), ''>='' (less than or equal to), and ''<>'' (not equal):
  
Once done, you will now have an empty table called ''person'', showing the columns stated above.
+
select * from movie where year = 1960;
 +
select * from movie where year < 1930;
 +
select * from movie where year >= 1960 and year < 1970;
  
=== Adding Data ===
+
=== Sorting ===
  
Now that a table exists to ''receive'' data, the ''data itself'' has to be “handed” to the table.  The SQL ''insert'' command accomplishes this:
+
As you play with various queries on the ''movie'' table, you’ve probably noticed that results are returned in no particular order; if you’d like to sort the results in some way, tack on an ''order by'' clause at the end of the ''select'' query:
  
<pre>insert into <table>(<columns>) values(<specific column values, typically enclosed in apostrophes>);</pre>
+
select * from movie where year > 2000 order by year
  
For example, in order to add a person named John Smith to the new person table, with numeric ID 1000 and a birthdate of June 30, 1980, you would perform:
+
This will display the records/rows for movies released from 2001 onward, sorted by year.  You can add more fields for a very specific sort order:
  
<pre>insert into person(id, firstname, lastname, dob) values(1000, 'John', 'Smith', '6/30/1980');</pre>
+
select * from movie where year > 2000 order by year, title
  
Note how this ''insert'' command adds one record at a time; adding records ''en masse'' in this manner is a matter of forming a bunch of ''insert'' commands and having '''psql''' perform them all, one at a time.
+
The above query returns the same records, but this time sorted by year first, then by title ''within'' each year.
  
As much as you like, use the '''up''' arrow to retrieve, then edit the previous ''insert'' command so that it adds a new record.
+
Sort order is ascending by default (e.g., A to Z, 0 to 9); for the reverse order, add ''desc'' to the field(s) that you’d like to see sorted in reverse:
  
Note that listing the columns of the table is not redundant; this listing allows you to change the order used for the succeeding values (e.g., ''lastname'' first, then ''firstname'', ''id'', and ''dob'').  Make sure to change the value of ''id'' to something that has not been added to the record yet &mdash; it has to be unique to the ''entire'' table since it has been designated as the primary key.
+
select * from movie where year > 2000 order by year desc, title;
  
==== sed Redux ====
+
This will display records with years displayed most recent first; within each year, however, titles will still be sorted in ascending order.
  
Text files such as ''movie_titles.txt'', with records corresponding to individual lines in those files, frequently serve as an initial data source for a relational database table.  Fortunately, '''psql''' has a useful property that allows us to use what we have learned previously in order to make database loading easier: it can participate in a ''pipe'', just like '''grep''', '''sed''', and others.  In other words, '''psql''' can be ''given'' lines of text from the outside via the vertical bar ('''|'''); if this is done, each line of text is expected to be a valid SQL command.  '''psql''' then performs each of these commands as if you had typed them in yourself.
+
=== Aggregate Queries ===
  
Thus, data loading from a text file can be done by:
+
For large databases, SQL provides ''aggregate'' (a.k.a. “grouping”) queries that summarize multiple records in different ways.
* Passing the text file through a series of '''sed''' commands so that each line of that file now looks like a valid SQL ''insert'' command.
+
* Adding, at the very end of this '''sed''' command sequence, a pipe into '''psql''':
+
<sequence of sed commands> | psql <database>
+
  
When '''psql''' starts, it will ask you for a password like it always does, then proceed to perform the SQL commands that are coming in exactly as if they had been typed in directly.
+
The simplest form of summary is counting: ''how many'' records were retrieved?  A simple overall count is done by using ''count(*)'' as the thing to select (or ''project'', in formal relational algebraic terms):
  
==== Dealing with Import Errors ====
+
select count(*) from movie where year > 2000
  
As you might have already seen, '''psql'''’s response to a correct insert statement is:
+
This will display the number of movies released from 2001 onward.
  
  INSERT 0 1
+
Of course, you can mix and match everything you have learned so far. To count the movies whose titles begin with a “B” that were released in 1975, you can query:
  
Any other message typically indicates an error. When piping a long sequence of ''insert'' commands through '''psql''', errors may fly by, and if your terminal window does not scroll back far enough, you may miss it.  If you do notice what may be error message during an import, remember that other tools are still available for piping &mdash; in particular, '''more'''.  Just like at any other time, piping through '''more''' will pause the text, one screenful at a time:
+
  select count(*) from movie where title ~ '^B' and year = 1975
  
  <sequence of sed commands> | psql <database> | more
+
Aggregators other than ''count'' are available, such as ''min'', ''max'', and ''avg'' (average or mean). Though somewhat odd-sounding, you can ask, for example, for the “average year” of movies with “London” in the title:
  
Of course, in a mixed error/no-error ''insert'' sequence, you’ll end up with some data in the target table, but not all of it. This can be easily verified through a combination of '''wc''' and SQL ''select''.  On the database side, you can count the number of records that got imported using:
+
  select average(year) from movie where title like '%London%'
  
<pre>select count(*) from <table name>;</pre>
+
Finally, you can aggregate ''multiple groups'' of data, so that you get different sets of statistics.  The ''group by'' keyword does this:
 +
 
 +
select year, count(*) from movie where year < 1935 group by year
 +
 
 +
The main rule with ''group by'' is that the column being grouped should also be part of the ''select'' clause.  This makes sense because otherwise, you wouldn’t be able to tell which group was which!  And of course, you can mix and match.  For example, the query above makes more sense if we arrange the data chronologically:
 +
 
 +
select year, count(*) from movie where year < 1935 group by year order by year
 +
 
 +
See any trends in that data?
 +
 
 +
=== Joins ===
 +
 
 +
Thus far, we’ve only been working with one table in the database: ''movie''.  Our other tables, ''member'' and ''rating'', add some further interest.  These tables store ratings, from 1 to 5, made by individual members on movies they have seen.
 +
 
 +
Examination of the ''rating'' table reveals that it has ''movie'', ''member'', and ''rating''.  Thus, each record consists of a single rating, made by a particular member for a particular movie.  Staying with one table for now, this query will list all of the ratings submitted by member no. 6:
 +
 
 +
select * from rating where member = 6
 +
 
 +
This returns the expected answer (based on the ''insert''s that you copy-pasted previously), but you may have noticed that the result isn’t quite as meaningful to us, since we get movie ''IDs'' back instead of ''titles''.  These movie titles, however, are in the ''movie'' table, not ''rating''.  We thus need to ''join'' the two tables.  As you might have seen when looking at the schema of the ''rating'' table, the ''movie'' field is a ''foreign key'' to the ''id'' field in the ''movie'' table.  Thus, every ''movie'' value in the ''rating'' table matches some ''id'' in the ''movie'' table, thus leading us to that movie’s title.
 +
 
 +
An SQL ''join'' uses the same basic ''select'' command, but requires the tables being joined (in this case ''movie'' and ''rating''), as well as the fields/columns to use for “joining” records.  “Joining” records means that their fields are combined to create a new “virtual” record.  All other parts of the ''select'' command retain the same meanings as before:
 +
 
 +
select '''columns''' from '''table1''' inner join '''table2''' on ('''join condition''') where '''conditions'''
 +
 
 +
Thus, the previous query, modified to retrieve the titles and ratings of the movies rated by member no. 6, looks like this:
  
You can then compare this number to the number of lines returned by '''wc''':
+
select title, rating from movie inner join rating on (movie.id = rating.movie) where member = 6
  
  cat <filename> | wc
+
As in the basic ''select'' command, you can tailor the ''where'' conditions as you need to pull ratings based on other criteria. For example, ratings for a particular movie, as opposed as for a particular member, can be retrieved by changing the ''where'' clause.  The query below displays all ratings for the movie(s) whose title has an apostrophe (see the last section below for more on this):
  
You may be one or two off, in case there are blank lines in the file. Larger discrepancies between ''count'' and '''wc''' are probably errors.
+
select rating from movie inner join rating on (movie.id = rating.movie) where title like '%<nowiki>''</nowiki>%'
  
Finally, after troubleshooting things, you may want to start overUse this last SQL command with caution, because it results in loss of data (which, in this one case, is what you ''want'' to happen &mdash; but just this time):
+
Note that the ''member'' column of the ''rating'' table is a foreign key as well—it refers to the ''id'' column of the ''member'' tableThe ''member'' table in our sample only includes member names, but in practice it can hold much more information.  We can join more than once; extending the query above, we can now produce all ratings for the movie(s) whose title has an apostrophe, ''and'' list the member names who gave those ratings:
  
<pre>delete from <table name>;</pre>
+
select name, rating from movie inner join rating on (movie.id = rating.movie) inner join member on (member.id = rating.member) where title like '%<nowiki>''</nowiki>%'
  
This command empties the table of all records, allowing you to start fresh.
+
Note how this approach helps avoid redundancy: instead of copying a member’s name over and over again, we can store the member’s name in one place, and use their ID as a reference to that name.  This elimination of redundancy is called ''normalization''—tables are ''normalized'' if they eliminate data redundancy in their columns (except, of course, for foreign keys).
  
==== The Notorious Apostrophe ====
+
=== The Notorious Apostrophe ===
  
 
You might have noticed that, because the apostrophe or single quote is used to indicate specific values in SQL (e.g., 'The Godfather', 'Smith', '6/30/1980', etc.), we run into a potential problem when the value itself should contain an apostrophe.  This is not as uncommon as one might think; for example, a good number of movie titles have apostrophes (''By Dawn's Early Light'', ''Zatoichi's Conspiracy'', ''Logan's Run'', and ''Dead Men Don't Wear Plaid'', to name a few), as do many names (“O'Malley,” “M'Benga,” “D'Angelo”).  An ''insert'' command such as the one below will result in an error, since the apostrophe will be misinterpreted as ending a piece of text rather than as part of the text itself:
 
You might have noticed that, because the apostrophe or single quote is used to indicate specific values in SQL (e.g., 'The Godfather', 'Smith', '6/30/1980', etc.), we run into a potential problem when the value itself should contain an apostrophe.  This is not as uncommon as one might think; for example, a good number of movie titles have apostrophes (''By Dawn's Early Light'', ''Zatoichi's Conspiracy'', ''Logan's Run'', and ''Dead Men Don't Wear Plaid'', to name a few), as do many names (“O'Malley,” “M'Benga,” “D'Angelo”).  An ''insert'' command such as the one below will result in an error, since the apostrophe will be misinterpreted as ending a piece of text rather than as part of the text itself:

Latest revision as of 05:23, 26 September 2013

This page gives you a tutorial-style walkthrough for using PostgreSQL. The walkthrough assumes that you’ve been set up for PostgreSQL use within the Keck lab infrastructure.

But first, a little leadoff cartoon: http://xkcd.com/327

Contents

[edit] Running PostgreSQL on the Keck Lab Windows Machines

  1. Login to the computer as usual
  2. From the Start/Windows icon menu, launch pgAdmin III
  3. The pgAdmin III window starts with a hierarchical view on the left that starts with three layers:
    • Server Groups
      • Servers (1)
        • PostgreSQL 9.2 (localhost:5432)
  4. Double-click on PostgreSQL 9.2 (localhost:5432) to connect to the database server
  5. The password to start the server is simply keck

[edit] Creating a Database

Once the server is running, the red x disappears from the PostgreSQL 9.2 (localhost:5432) icon, and additional icons appear beneath it. If you click on the + button to the left of the Databases icon, you will see the databases that are currently available. Initially, you will see a single database called postgres.

To do your work and practice some SQL, it is recommended that you work on a database of your own. To create a database, right-click on the Databases icon and choose New Database... from the menu that appears. In the New Database dialog, the only information you need to supply is your new database's name. To avoid confusion in case multiple students use the same computer, use your Keck lab username as the name of your database.

When you click OK, you will return to the main pgAdmin III window and you should see your new database underneath the Databases icon.

Note that you only need to go through this creation process once; that database will remain available until it is explicitly deleted.

[edit] Connecting to a Database

To start using a database, click on its icon. The red x disappears from the database icon and you should now be able to work.

[edit] Walkthrough: Loading the Sample Movie Table Into Your Database

Before we can dive into SQL, we need to set up some information that we can access. In doing this, you will see how data in a plain text file can find its way into a full-fledged relational table.

We will load up the movie_titles.txt file in ~dondi/xmlpipedb/data into your own database. If you cat that file, you will see that it looks like this:

17761,2003,Levity
17762,1997,Gattaca
17763,1978,Interiors
17764,1998,Shakespeare in Love
17765,1969,Godzilla's Revenge
17766,2002,Where the Wild Things Are and Other Maurice Sendak Stories
17767,2004,Fidel Castro: American Experience
17768,2000,Epoch
17769,2003,The Company
17770,2003,Alien Hunter

(this is from the end of the file)

Looking at the information, we recognize that this file consists of a movie ID, a year, and a title. Thus, we need to prepare a table with these columns in our database.

[edit] Create the Movie Table

Switching back to pgAdmin III, click the SQL button in the toolbar. A new window with an SQL Editor tab appears. The following command will create your movie table; type this into that tab:

create table movie (id int primary key, year int, title varchar)

As always, watch out for typos! When ready, click on the Execute query button in the toolbar. (its button looks like a green play button)

Upon executing the query, the following should appear in the Messages tab of the Output Pane in the bottom half of the window:

NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movie_pkey" for table "movie"
Query returned successfully with no result in 101 ms.

To assure yourself that the movie table is indeed there, type and execute this query:

select * from movie

The Data Output tab of the Output Pane should now show an empty tabular display with headings for id integer, year integer, and title character varying.

[edit] Prepare Data for Insertion into the Movie Table

At this point, you have a movie table, but no data: that information currently resides in movie_titles.txt over in the Keck lab's my.cs.lmu.edu server. How do we get that data into our movie table?

Recall that the SQL command for adding data has this format:

insert into table (columns) values (values)

In other words, this text data:

17761,2003,Levity
17762,1997,Gattaca
17763,1978,Interiors
17764,1998,Shakespeare in Love
17765,1969,Godzilla's Revenge
17766,2002,Where the Wild Things Are and Other Maurice Sendak Stories
17767,2004,Fidel Castro: American Experience
17768,2000,Epoch
17769,2003,The Company
17770,2003,Alien Hunter

...must be made to look like this:

insert into movie(id, year, title) values (17761,2003,'Levity');
insert into movie(id, year, title) values (17762,1997,'Gattaca');
insert into movie(id, year, title) values (17763,1978,'Interiors');
insert into movie(id, year, title) values (17764,1998,'Shakespeare in Love');
insert into movie(id, year, title) values (17765,1969,'Godzillas Revenge');
insert into movie(id, year, title) values (17766,2002,'Where the Wild Things Are and Other Maurice Sendak Stories');
insert into movie(id, year, title) values (17767,2004,'Fidel Castro: American Experience');
insert into movie(id, year, title) values (17768,2000,'Epoch');
insert into movie(id, year, title) values (17769,2003,'The Company');
insert into movie(id, year, title) values (17770,2003,'Alien Hunter');

Note a few rules here that might not be very obvious:

  • Text values need to be enclosed between single quotes; numbers don't.
  • Note, in the case of movie 17765, Godzilla's Revenge, that the movie title itself contains a single quote (apostrophe). To distinguish an “in-text” apostrophe from a “wrapper” apostrophe, we double it up; that's why the SQL above shows Godzilla's Revenge with the apostrophe converted into two.
  • When performing multiple SQL queries, the semicolon (;) is used to distinguish one query from another. Think of the semicolon as playing the same role that periods (.) do in plain English sentences.

Hmmmmm...do we have a program that can do this? We need to take the lines in the movie_titles.txt file and convert them into valid SQL insert commands. Why yes we do, and you have already used it: sed.

Since this is a tutorial, we won't spend time to explain exactly how we come up with the sed command below. But you should be able to see the pieces:

  • We need to turn all single apostrophes into doubles.
  • We then need to “wrap” the titles at the end around single quotes.
  • Finally, we append the insert command before each line, and
  • end each line with a parenthesis and semicolon.

Finally, we somehow need to get the data into the PostgreSQL server running on your workstation. For this, we will use the built-in web server of the my.cs.lmu.edu host: we will deposit the results of sed into a file that a web browser can then display. From the web browser, we can copy and paste the insert statements into the pgAdmin III SQL window.

All that said (no pun intended), this is the command that you want (invoke it from ~dondi/xmlpipedb/data):

cat movie_titles.txt | sed "s/'/''/g" | sed "s/,/,'/2" | sed "s/^/insert into movie(id,year,title) values(/g" | sed "s/$/');/g" > ~/public_html/movie.sql.txt

At this point, you can probably work out the sed commands. The portion we will explain just a little bit more is the last section, > ~/public_html/movie.sql.txt. We have not needed to use the > symbol before, but now it is just what we need: it “sends” the result of the prior sed commands into a file. That file is placed in your public_html folder, which, if you recall, is visible on the web as http://my.cs.lmu.edu/~your_username/.

[edit] Insert!

Finally, we can feed these 17,770 insert statements (quick, how did we know this?) into PostgreSQL. Open a new browser tab or window and go to http://my.cs.lmu.edu/~username/movie.sql.txt (remember to substitute username with your Keck lab ssh/PuTTY login. You should see your fresh sed product in the browser. From here, Select All and Copy the commands.

Switch to the pgAdmin III SQL window, empty out the SQL Editor tab, then Paste your insert statements into the tab. Finally, execute the query (the green play button, remember?) and let it work. With 17,770 records, this takes a little bit longer than prior commands that you have run.

In the end, the Messages tab in the Output Pane should say something like:

Query returned successfully: one row affected, 2325 ms execution time.

(exact execution times will vary)

Once more, check your work; re-execute this:

select * from movie

This time, you should see a fully-populated Data Output tab, with...17,770 rows.

[edit] Walkthrough: Adding a Few More Tables

To make our database a little more interesting, will create a couple more tables and load them with some data. The rest of the tutorial will be based on these tables and their content.

At this stage, you’ve already seen these commands in some form, so we’ll skip the explanation and go right to the commands. Copy, paste, and run this (you successfully copied and pasted 17,770 insert commands previously, so this should be no sweat!):

create table member (id int primary key, name varchar);
create table rating (movie int references movie(id), member int references member(id), rating int);
insert into member(id, name) values (6, 'Natalie');
insert into member(id, name) values (8, 'Boomer');
insert into member(id, name) values (42, 'Doug');
insert into rating(movie, member, rating) values(209, 6, 5);
insert into rating(movie, member, rating) values(2040, 6, 5);
insert into rating(movie, member, rating) values(6908, 6, 2);
insert into rating(movie, member, rating) values(2610, 6, 1);
insert into rating(movie, member, rating) values(8809, 6, 2);
insert into rating(movie, member, rating) values(37, 6, 4);
insert into rating(movie, member, rating) values(113, 6, 2);
insert into rating(movie, member, rating) values(8687, 8, 1);
insert into rating(movie, member, rating) values(9628, 8, 5);
insert into rating(movie, member, rating) values(10877, 8, 2);
insert into rating(movie, member, rating) values(12513, 8, 1);
insert into rating(movie, member, rating) values(15923, 8, 2);
insert into rating(movie, member, rating) values(15532, 8, 3);
insert into rating(movie, member, rating) values(4006, 8, 3);
insert into rating(movie, member, rating) values (30, 42, 4);
insert into rating(movie, member, rating) values (113, 42, 5);
insert into rating(movie, member, rating) values (37, 42, 5);
insert into rating(movie, member, rating) values (4765, 42, 5);
insert into rating(movie, member, rating) values (6762, 42, 3);
insert into rating(movie, member, rating) values (6853, 42, 4);
insert into rating(movie, member, rating) values (10176, 42, 4);
insert into rating(movie, member, rating) values (13847, 42, 5);
insert into rating(movie, member, rating) values (15127, 42, 3);
insert into rating(movie, member, rating) values (15532, 42, 1);

You should see the customary success messages once the commands are done. Time to play!

[edit] An Introduction to SQL Select Queries

Database activities are triggered via SQL commands. The previous SQL PDF handout gives you more of a reference/nutshell view of SQL; this page walks you through some commands step-by-step, using the sample movie database that you set up earlier in this page.

The select command is the SQL “kitchen sink” for retrieving information from a database. Its general, basic form is:

select columns from tables where conditions

As you will see, select can do even more, but let’s start simple.

[edit] Basics

The simplest type of select queries involve getting records from an individual table based on relatively simple conditions. For example:

select year from movie where title = 'Metropolis';

...will retrieve the release years of movies whose titles are exactly “Metropolis.” If you try this query, you should see two rows: one for 2001 and another for 1927.

In addition to equality, like lends further flexibility. The like comparison allows for pattern matching, similar but not identical to grep. In SQL, the percent sign (“%”) is a “wildcard” that can represent any number of letters and symbols. like and % can be combined for broader queries, such as this one, which retrieves all movie titles that have the word “Vampire” in them:

select title from movie where title like '%Vampire%';

If you want select to display all columns of a database record, use the asterisk (*):

select * from movie where title like '19%';

This will display every column/field of movie records whose title starts with “19”—these mostly appear to be films pertaining to a specific year in the 20th century.

Conditions can be combined via and and or; for example, retrieving movies whose title contains either “DNA” or “Bio” can be done with:

select * from movie where title like '%DNA%' or title like '%Bio%';

When using more than two conditions, watch out for how conditions are grouped together; this query, for example, may yield unexpected results:

select * from movie where title like '%DNA%' or title like '%Bio%' and title like '%:%';

At face value, the above query may read like “movies whose titles have either DNA or Bio which also have a colon (:);” in reality, the colon criterion is only and-ed with movies that have “Bio” in the title. Thus, the database actually interprets this query as “movies whose titles have DNA, or whose titles have Bio and a colon.”

To eliminate any ambiguities, use parentheses to group conditions together:

select * from movie where (title like '%DNA%' or title like '%Bio%') and title like '%:%';

These parentheses force the database to pick out movies with either title first, and then check if these movies have a colon in their titles.

The like comparator can do simple text matches, but it does not use regular expressions (i.e., the search patterns recognized by grep and sed). This area is a little shaky in SQL-land; there is an official similar to comparator which is the official way to make regular expression comparisons, but the format for those expressions is not the same as the format used by grep and sed.

Fortunately, PostgreSQL has a specific, PostgreSQL-only comparator that does match the same patterns used by grep and sed: the tilde (~). Comparing with ~ is equivalent to a grep- or sed-like comparison:

select * from movie where title ~ 'Vamp[iy]re';
select * from movie where title ~ 'End$';
select * from movie where title ~ 'Colou?r';

The caveat here is that ~ is a PostgreSQL-specific feature: if you move to other database systems (such as Microsoft Access), that feature may either be done differently or missing completely, since it is not part of the official SQL standard.

Numeric values, such as the year column, can be compared using =, <, >, <= (greater than or equal to), >= (less than or equal to), and <> (not equal):

select * from movie where year = 1960;
select * from movie where year < 1930;
select * from movie where year >= 1960 and year < 1970;

[edit] Sorting

As you play with various queries on the movie table, you’ve probably noticed that results are returned in no particular order; if you’d like to sort the results in some way, tack on an order by clause at the end of the select query:

select * from movie where year > 2000 order by year

This will display the records/rows for movies released from 2001 onward, sorted by year. You can add more fields for a very specific sort order:

select * from movie where year > 2000 order by year, title

The above query returns the same records, but this time sorted by year first, then by title within each year.

Sort order is ascending by default (e.g., A to Z, 0 to 9); for the reverse order, add desc to the field(s) that you’d like to see sorted in reverse:

select * from movie where year > 2000 order by year desc, title;

This will display records with years displayed most recent first; within each year, however, titles will still be sorted in ascending order.

[edit] Aggregate Queries

For large databases, SQL provides aggregate (a.k.a. “grouping”) queries that summarize multiple records in different ways.

The simplest form of summary is counting: how many records were retrieved? A simple overall count is done by using count(*) as the thing to select (or project, in formal relational algebraic terms):

select count(*) from movie where year > 2000

This will display the number of movies released from 2001 onward.

Of course, you can mix and match everything you have learned so far. To count the movies whose titles begin with a “B” that were released in 1975, you can query:

select count(*) from movie where title ~ '^B' and year = 1975

Aggregators other than count are available, such as min, max, and avg (average or mean). Though somewhat odd-sounding, you can ask, for example, for the “average year” of movies with “London” in the title:

select average(year) from movie where title like '%London%'

Finally, you can aggregate multiple groups of data, so that you get different sets of statistics. The group by keyword does this:

select year, count(*) from movie where year < 1935 group by year

The main rule with group by is that the column being grouped should also be part of the select clause. This makes sense because otherwise, you wouldn’t be able to tell which group was which! And of course, you can mix and match. For example, the query above makes more sense if we arrange the data chronologically:

select year, count(*) from movie where year < 1935 group by year order by year

See any trends in that data?

[edit] Joins

Thus far, we’ve only been working with one table in the database: movie. Our other tables, member and rating, add some further interest. These tables store ratings, from 1 to 5, made by individual members on movies they have seen.

Examination of the rating table reveals that it has movie, member, and rating. Thus, each record consists of a single rating, made by a particular member for a particular movie. Staying with one table for now, this query will list all of the ratings submitted by member no. 6:

select * from rating where member = 6

This returns the expected answer (based on the inserts that you copy-pasted previously), but you may have noticed that the result isn’t quite as meaningful to us, since we get movie IDs back instead of titles. These movie titles, however, are in the movie table, not rating. We thus need to join the two tables. As you might have seen when looking at the schema of the rating table, the movie field is a foreign key to the id field in the movie table. Thus, every movie value in the rating table matches some id in the movie table, thus leading us to that movie’s title.

An SQL join uses the same basic select command, but requires the tables being joined (in this case movie and rating), as well as the fields/columns to use for “joining” records. “Joining” records means that their fields are combined to create a new “virtual” record. All other parts of the select command retain the same meanings as before:

select columns from table1 inner join table2 on (join condition) where conditions

Thus, the previous query, modified to retrieve the titles and ratings of the movies rated by member no. 6, looks like this:

select title, rating from movie inner join rating on (movie.id = rating.movie) where member = 6

As in the basic select command, you can tailor the where conditions as you need to pull ratings based on other criteria. For example, ratings for a particular movie, as opposed as for a particular member, can be retrieved by changing the where clause. The query below displays all ratings for the movie(s) whose title has an apostrophe (see the last section below for more on this):

select rating from movie inner join rating on (movie.id = rating.movie) where title like '%''%'

Note that the member column of the rating table is a foreign key as well—it refers to the id column of the member table. The member table in our sample only includes member names, but in practice it can hold much more information. We can join more than once; extending the query above, we can now produce all ratings for the movie(s) whose title has an apostrophe, and list the member names who gave those ratings:

select name, rating from movie inner join rating on (movie.id = rating.movie) inner join member on (member.id = rating.member) where title like '%''%'

Note how this approach helps avoid redundancy: instead of copying a member’s name over and over again, we can store the member’s name in one place, and use their ID as a reference to that name. This elimination of redundancy is called normalization—tables are normalized if they eliminate data redundancy in their columns (except, of course, for foreign keys).

[edit] The Notorious Apostrophe

You might have noticed that, because the apostrophe or single quote is used to indicate specific values in SQL (e.g., 'The Godfather', 'Smith', '6/30/1980', etc.), we run into a potential problem when the value itself should contain an apostrophe. This is not as uncommon as one might think; for example, a good number of movie titles have apostrophes (By Dawn's Early Light, Zatoichi's Conspiracy, Logan's Run, and Dead Men Don't Wear Plaid, to name a few), as do many names (“O'Malley,” “M'Benga,” “D'Angelo”). An insert command such as the one below will result in an error, since the apostrophe will be misinterpreted as ending a piece of text rather than as part of the text itself:

insert into person(id, firstname, lastname, dob) values(2000, 'Beverly', 'D'Angelo', '8/21/1960');

Fortunately, SQL has a solution: apostrophes inside text should be indicated via two consecutive apostrophes, or ''. When encountered, SQL converts this pair of apostrophes into a single one, and does not interpret these apostrophes as ending a piece of text. Thus, the above command will work if rewritten in this way:

insert into person(id, firstname, lastname, dob) values(2000, 'Beverly', 'D''Angelo', '8/21/1960');

While the solution does exist, it isn’t automatic: you need to be aware that apostrophes have to be written as “double apostrophes” before passing any text values on to SQL. Keep this in mind when trying to load data from a text file into a database table.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox