PostgreSQL Tutorial

From LMU BioDB 2013
Revision as of 23:17, 24 September 2013 by Dondi (Talk | contribs)

Jump to: navigation, search

This page gives you a tutorial-style walkthrough for using PostgreSQL. The walkthrough assumes that you’ve been set up for PostgreSQL use within the Keck lab infrastructure.

But first, a little leadoff cartoon: http://xkcd.com/327

Contents

Running PostgreSQL on the Keck Lab Windows Machines

  1. Login to the computer as usual
  2. From the Start/Windows icon menu, launch pgAdmin III
  3. The pgAdmin III window starts with a hierarchical view on the left that starts with three layers:
    • Server Groups
      • Servers (1)
        • PostgreSQL 9.2 (localhost:5432)
  4. Double-click on PostgreSQL 9.2 (localhost:5432) to connect to the database server
  5. The password to start the server is simply keck

An Introduction to SQL

Of course, “true” database activities are triggered via SQL commands. The previous SQL PDF handout gives you more of a reference/nutshell view of SQL; this page walks you through some commands step-by-step, using the kerfuffle database that has been preloaded into the Keck lab for you. Then, you’ll practice making your own tables within that database.

Select Queries

The R in CRUD, “retrieve,” is, somewhat inconsistently, performed by the select command. The select command is the SQL “kitchen sink” for retrieving information from a database. It’s general, basic form is:

select <columns> from <tables> where <conditions>;

As you will see, select can do even more, but let’s start simple.

For this portion, make sure that you are connected to the kerfuffle database.

Basics

The simple (though large) kerfuffle database consists of two tables: movie and rating. The movie table is derived from the same data as the movie_titles.txt file in the ~xmlpipedb/data directory, and as such holds the same information: numeric movie IDs, release years, and movie titles.

Due to “impurities” in the data (i.e., some years are marked as “NULL”), the release year had to be loaded as text and not as numbers. This prevents us from using numeric conditions like > and < on release years, but, as you will see, there are ways around that. Ideally, data should be “cleaned up” before it’s loaded into a database — something that may or may not be possible, so we may as well get used to working with “imperfect” data.

The simplest type of select queries involve getting records from an individual table based on relatively simple conditions. For example:

select year from movie where title = 'Metropolis';

...will retrieve the release years of movies whose titles are exactly “Metropolis.” If you try this query, you should see two rows: one for 2001 and another for 1927.

In addition to equality, like lends further flexibility. The like comparison allows for pattern matching, similar but not identical to grep. In SQL, the percent sign (“%”) is a “wildcard” that can represent any number of letters and symbols. like and % can be combined for broader queries, such as this one, which retrieves all movie titles that have the word “Vampire” in them:

select title from movie where title like '%Vampire%';

psql’s data display works a lot like more: if there’s more data than can fit in the command window, psql will pause and recognize the following keys:

  • Enter moves through the data one line at a time
  • The space bar moves through the data one window at a time
  • q exits the data display

If you want select to display all columns of a database record, use the asterisk (*):

select * from movie where year like '196%';

This will display every column/field of movie records whose year starts with “196” — effectively, all movies released in the 1960s.

Conditions can be combined via and and or; for example, retrieving movies whose title contains either “DNA” or “Bio” can be done with:

select * from movie where title like '%DNA%' or title like '%Bio%';

When using more than two conditions, watch out for how conditions are grouped together; this query, for example, may yield unexpected results:

select * from movie where title like '%DNA%' or title like '%Bio%' and year like '200%';

At face value, the above query may read like “movies whose titles have either DNA or Bio which were released on or after 2000;” in reality, the year criterion is only and-ed with movies that have “Bio” in the title. Thus, the database actually interprets this query as “movies whose titles have DNA, or whose titles have Bio which were released on or after 2000.”

To eliminate any ambiguities, use parentheses to group conditions together:

select * from movie where (title like '%DNA%' or title like '%Bio%') and year like '200%';

These parentheses force the database to pick out movies with either title first, and then check if these movies have years beginning with “200.”

The like comparator can do simple text matches, but it does not use regular expressions (i.e., the search patterns recognized by grep and sed). This area is a little shaky in SQL-land; there is an official similar to comparator which is the official way to make regular expression comparisons, but the format for those expressions is not the same as the format used by grep and sed.

Fortunately, PostgreSQL has a specific, PostgreSQL-only comparator that does match the same patterns used by grep and sed: the tilde (~). Comparing with ~ is equivalent to a grep- or sed-like comparison:

select * from movie where title ~ 'Vamp[iy]re';
select * from movie where title ~ 'End$';
select * from movie where title ~ 'Colou?r';

The caveat here is that ~ is a PostgreSQL-specific feature: if you move to other database systems (such as Microsoft Access), that feature may either be done differently or missing completely, since it is not part of the official SQL standard.

Sorting

As you play with various queries on the movie table, you’ve probably noticed that results are returned in no particular order; if you’d like to sort the results in some way, tack on an order by clause at the end of the select query:

select * from movie where year like '200%' order by year;

This will display the records/rows for movies released from 2000 to 2009, sorted by year. You can add more fields for a very specific sort order:

select * from movie where year like '200%' order by year, title;

The above query returns the same records, but this time sorted by year first, then by title within each year.

Sort order is ascending by default (e.g., A to Z, 0 to 9); for the reverse order, add desc to the field(s) that you’d like to see sorted in reverse:

select * from movie where year like '200%' order by year desc, title;

This will display records with years displayed most recent first; within each year, however, titles will still be sorted in ascending order.

Joins

Thus far, we’ve only been working with one table in the kerfuffle database: movie. A second (huge) table adds some further interest: rating. This table stores ratings, from 1 to 5, made by individual members on movies they have seen. Quick recall: what command should you type in order to see the structure (schema) of the rating table?

Examination of the rating table reveals that it has movie, member, rating, and rating_date fields. Thus, each record consists of a single rating, made by a particular member for a particular movie, and the date of that rating’s submission. Staying with one table for now, this query will list all of the ratings submitted by member no. 6:

select * from rating where member = 6;

This returns a good chunk of data, but you may have noticed that the result isn’t quite as meaningful to us, since we get movie IDs back instead of titles. These movie titles, however, are in the movie table, not rating. We thus need to join the two tables. As you might have seen when looking at the schema of the rating table, the movie field is a foreign key to the id field in the movie table. Thus, every movie value in the rating table matches some id in the movie table, thus leading us to that movie’s title.

An SQL join uses the same basic select command, but requires the tables being joined (in this case movie and rating), as well as the fields/columns to use for “joining” records. “Joining” records means that their fields are combined to create a new “virtual” record. All other parts of the select command retain the same meanings as before:

select <columns> from <table1> inner join <table2> on (<join condition>) where <conditions>;

Thus, the previous query, modified to retrieve the titles and ratings of the movies rated by member no. 6, looks like this:

select title, rating from movie inner join rating on (movie.id = rating.movie) where member = 6;

As in the basic select command, you can tailor the where conditions as you need to pull ratings based on other criteria. For example, ratings for a particular movie, as opposed as for a particular member, can be retrieved by changing the where clause. The query below displays all ratings for the movie(s) whose title is Love Story:

select rating from movie inner join rating on(movie.id = rating.movie) where title = 'Love Story';

Aggregate Queries

There are lots of ratings in the kerfuffle database — so many that just displaying the individual ratings records may not be very useful. For large databases, SQL provides aggregate (a.k.a. “grouping”) queries that summarize multiple records in different ways.

The simplest form of summary is counting: how many records were retrieved? A simple overall count is done by using count(*) as the thing to select (or project, in formal relational algebraic terms):

select count(*) from movie where year like '200%';

This will display the number of movies whose release years start with “200.”

Aside from grouping an overall set of records, subgroups can also be processed. Consider this non-aggregating query:

select rating from movie inner join rating on (id = movie) where title = 'The Godfather';

This will give you a bunch of 1s, 2s, 3s, 4s, and 5s, as were submitted by members who have seen the film The Godfather. This is almost information overload — we don’t really care about the individual ratings, but how many of each rating was given. So if you think about it, what we really want is:

  1. Gather all of the records for a particular rating (1, 2, 3, 4, or 5)
  2. For each of these groups of records, perform a count

When this type of grouping is desired, the query changes in these ways:

  • The columns indicated in the select clause now list the value that is being grouped, plus the way those groups are aggregated (in this case, we’d like to count the number of records in each group)
  • A new clause, the group by clause, is added after the where clause to indicate the field whose values should be used as the basis for grouping.

Thus, finding the number of ratings received by The Godfather, for each of the five possible ratings, is done with this query:

select rating, count(rating) from movie inner join rating on (id = movie) where title = 'The Godfather' group by rating;

As before, you can experiment with queries such as these, but for different movies and/or ratings, before moving further down the page. If you feel like exploring this type of query further, rest assured that there’s a lot more functionality available; either use \h select or look up the select command’s group by clause on the Web.


Creating Tables

Now that you’ve logged some time with retrieving data, you’re probably aware that someone must have put the data there in the first place. First, the table(s) holding the data must be set up, or created — this is the C in CRUD. Half of table creation actually occurs without a computer: you (or the database designer) must determine the schema for the table independently of the database. This schema consists of the fields in the table, the kind of data each field will hold (numbers, text, dates, etc.), and any primary or foreign keys in the table. Ideally, you should have a complete relational schema diagram, such as the one shown in the handout or worked on previously in class.

With this table schema, creating the table in SQL is a matter of converting the diagram into a create table command:

create table <tablename> (<columns, their data types, and the primary key, separated by commas>);

If you’re still connected to the kerfuffle database, disconnect via \q, then run psql again, this time all by itself. This should connect you to your individual database.

Suppose you’ve designed a person table that looks like this:

id
firstname
lastname
dob

...where id should be an integer, firstname and lastname can be any text, and dob is some date. In SQL, an “integer” is indicated by int, text is indicated by varchar, and a date is indicated by (for once) date. Further, based on the shading, the primary key of this table is id. Putting this all together yields the following commands:

create table person (id int primary key, firstname varchar, lastname varchar, dob date);

Once done, you will now have an empty table called person, showing the columns stated above.

Adding Data

Now that a table exists to receive data, the data itself has to be “handed” to the table. The SQL insert command accomplishes this:

insert into <table>(<columns>) values(<specific column values, typically enclosed in apostrophes>);

For example, in order to add a person named John Smith to the new person table, with numeric ID 1000 and a birthdate of June 30, 1980, you would perform:

insert into person(id, firstname, lastname, dob) values(1000, 'John', 'Smith', '6/30/1980');

Note how this insert command adds one record at a time; adding records en masse in this manner is a matter of forming a bunch of insert commands and having psql perform them all, one at a time.

As much as you like, use the up arrow to retrieve, then edit the previous insert command so that it adds a new record.

Note that listing the columns of the table is not redundant; this listing allows you to change the order used for the succeeding values (e.g., lastname first, then firstname, id, and dob). Make sure to change the value of id to something that has not been added to the record yet — it has to be unique to the entire table since it has been designated as the primary key.

sed Redux

Text files such as movie_titles.txt, with records corresponding to individual lines in those files, frequently serve as an initial data source for a relational database table. Fortunately, psql has a useful property that allows us to use what we have learned previously in order to make database loading easier: it can participate in a pipe, just like grep, sed, and others. In other words, psql can be given lines of text from the outside via the vertical bar (|); if this is done, each line of text is expected to be a valid SQL command. psql then performs each of these commands as if you had typed them in yourself.

Thus, data loading from a text file can be done by:

  • Passing the text file through a series of sed commands so that each line of that file now looks like a valid SQL insert command.
  • Adding, at the very end of this sed command sequence, a pipe into psql:
<sequence of sed commands> | psql <database>

When psql starts, it will ask you for a password like it always does, then proceed to perform the SQL commands that are coming in exactly as if they had been typed in directly.

Dealing with Import Errors

As you might have already seen, psql’s response to a correct insert statement is:

INSERT 0 1

Any other message typically indicates an error. When piping a long sequence of insert commands through psql, errors may fly by, and if your terminal window does not scroll back far enough, you may miss it. If you do notice what may be error message during an import, remember that other tools are still available for piping — in particular, more. Just like at any other time, piping through more will pause the text, one screenful at a time:

<sequence of sed commands> | psql <database> | more

Of course, in a mixed error/no-error insert sequence, you’ll end up with some data in the target table, but not all of it. This can be easily verified through a combination of wc and SQL select. On the database side, you can count the number of records that got imported using:

select count(*) from <table name>;

You can then compare this number to the number of lines returned by wc:

cat <filename> | wc

You may be one or two off, in case there are blank lines in the file. Larger discrepancies between count and wc are probably errors.

Finally, after troubleshooting things, you may want to start over. Use this last SQL command with caution, because it results in loss of data (which, in this one case, is what you want to happen — but just this time):

delete from <table name>;

This command empties the table of all records, allowing you to start fresh.

The Notorious Apostrophe

You might have noticed that, because the apostrophe or single quote is used to indicate specific values in SQL (e.g., 'The Godfather', 'Smith', '6/30/1980', etc.), we run into a potential problem when the value itself should contain an apostrophe. This is not as uncommon as one might think; for example, a good number of movie titles have apostrophes (By Dawn's Early Light, Zatoichi's Conspiracy, Logan's Run, and Dead Men Don't Wear Plaid, to name a few), as do many names (“O'Malley,” “M'Benga,” “D'Angelo”). An insert command such as the one below will result in an error, since the apostrophe will be misinterpreted as ending a piece of text rather than as part of the text itself:

insert into person(id, firstname, lastname, dob) values(2000, 'Beverly', 'D'Angelo', '8/21/1960');

Fortunately, SQL has a solution: apostrophes inside text should be indicated via two consecutive apostrophes, or ''. When encountered, SQL converts this pair of apostrophes into a single one, and does not interpret these apostrophes as ending a piece of text. Thus, the above command will work if rewritten in this way:

insert into person(id, firstname, lastname, dob) values(2000, 'Beverly', 'D''Angelo', '8/21/1960');

While the solution does exist, it isn’t automatic: you need to be aware that apostrophes have to be written as “double apostrophes” before passing any text values on to SQL. Keep this in mind when trying to load data from a text file into a database table.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox