Difference between revisions of "PostgreSQL Tutorial"
(→Insert!: Write up insertion instructions.) |
(Remove join/ratings references.) |
||
Line 158: | Line 158: | ||
==== Basics ==== | ==== Basics ==== | ||
− | |||
− | |||
− | |||
− | |||
The simplest type of ''select'' queries involve getting records from an individual table based on relatively simple conditions. For example: | The simplest type of ''select'' queries involve getting records from an individual table based on relatively simple conditions. For example: | ||
Line 227: | Line 223: | ||
This will display records with years displayed most recent first; within each year, however, titles will still be sorted in ascending order. | This will display records with years displayed most recent first; within each year, however, titles will still be sorted in ascending order. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==== Aggregate Queries ==== | ==== Aggregate Queries ==== | ||
− | + | For large databases, SQL provides ''aggregate'' (a.k.a. “grouping”) queries that summarize multiple records in different ways. | |
The simplest form of summary is counting: ''how many'' records were retrieved? A simple overall count is done by using ''count(*)'' as the thing to select (or ''project'', in formal relational algebraic terms): | The simplest form of summary is counting: ''how many'' records were retrieved? A simple overall count is done by using ''count(*)'' as the thing to select (or ''project'', in formal relational algebraic terms): | ||
Line 259: | Line 233: | ||
This will display the number of movies whose release years start with “200.” | This will display the number of movies whose release years start with “200.” | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Creating Tables === | === Creating Tables === | ||
Line 327: | Line 278: | ||
Note that listing the columns of the table is not redundant; this listing allows you to change the order used for the succeeding values (e.g., ''lastname'' first, then ''firstname'', ''id'', and ''dob''). Make sure to change the value of ''id'' to something that has not been added to the record yet — it has to be unique to the ''entire'' table since it has been designated as the primary key. | Note that listing the columns of the table is not redundant; this listing allows you to change the order used for the succeeding values (e.g., ''lastname'' first, then ''firstname'', ''id'', and ''dob''). Make sure to change the value of ''id'' to something that has not been added to the record yet — it has to be unique to the ''entire'' table since it has been designated as the primary key. | ||
− | ==== | + | === Deleting Data === |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | Use this last SQL command with caution, because it results in loss of data (which, in this one case, is what you may ''want'' to happen—but just this time): | |
<pre>delete from <table name>;</pre> | <pre>delete from <table name>;</pre> | ||
− | This command empties the table of all records, allowing you to start fresh. | + | This command empties the table of all records, allowing you to start fresh. To delete records selectively, you may include a <code>where</code> clause just like the one used in <code>select</code>. |
==== The Notorious Apostrophe ==== | ==== The Notorious Apostrophe ==== |
Revision as of 00:46, 25 September 2013
This page gives you a tutorial-style walkthrough for using PostgreSQL. The walkthrough assumes that you’ve been set up for PostgreSQL use within the Keck lab infrastructure.
But first, a little leadoff cartoon: http://xkcd.com/327
Contents |
Running PostgreSQL on the Keck Lab Windows Machines
- Login to the computer as usual
- From the Start/Windows icon menu, launch pgAdmin III
- The pgAdmin III window starts with a hierarchical view on the left that starts with three layers:
- Server Groups
- Servers (1)
- PostgreSQL 9.2 (localhost:5432)
- Servers (1)
- Server Groups
- Double-click on PostgreSQL 9.2 (localhost:5432) to connect to the database server
- The password to start the server is simply keck
Creating a Database
Once the server is running, the red x disappears from the PostgreSQL 9.2 (localhost:5432) icon, and additional icons appear beneath it. If you click on the + button to the left of the Databases icon, you will see the databases that are currently available. Initially, you will see a single database called postgres.
To do your work and practice some SQL, it is recommended that you work on a database of your own. To create a database, right-click on the Databases icon and choose New Database... from the menu that appears. In the New Database dialog, the only information you need to supply is your new database's name. To avoid confusion in case multiple students use the same computer, use your Keck lab username as the name of your database.
When you click OK, you will return to the main pgAdmin III window and you should see your new database underneath the Databases icon.
Note that you only need to go through this creation process once; that database will remain available until it is explicitly deleted.
Connecting to a Database
To start using a database, click on its icon. The red x disappears from the database icon and you should now be able to work.
Walkthrough: Loading the Sample Movie Table Into Your Database
Before we can dive into SQL, we need to set up some information that we can access. In doing this, you will see how data in a plain text file can find its way into a full-fledged relational table.
We will load up the movie_titles.txt file in ~dondi/xmlpipedb/data into your own database. If you cat that file, you will see that it looks like this:
17761,2003,Levity 17762,1997,Gattaca 17763,1978,Interiors 17764,1998,Shakespeare in Love 17765,1969,Godzilla's Revenge 17766,2002,Where the Wild Things Are and Other Maurice Sendak Stories 17767,2004,Fidel Castro: American Experience 17768,2000,Epoch 17769,2003,The Company 17770,2003,Alien Hunter
(this is from the end of the file)
Looking at the information, we recognize that this file consists of a movie ID, a year, and a title. Thus, we need to prepare a table with these columns in our database.
Create the Movie Table
Switching back to pgAdmin III, click the SQL button in the toolbar. A new window with an SQL Editor tab appears. The following command will create your movie table; type this into that tab:
create table movie (id int primary key, year int, title varchar)
As always, watch out for typos! When ready, click on the Execute query button in the toolbar. (its button looks like a green play button)
Upon executing the query, the following should appear in the Messages tab of the Output Pane in the bottom half of the window:
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "movie_pkey" for table "movie" Query returned successfully with no result in 101 ms.
To assure yourself that the movie table is indeed there, type and execute this query:
select * from movie
The Data Output tab of the Output Pane should now show an empty tabular display with headings for id integer, year integer, and title character varying.
Prepare Data for Insertion into the Movie Table
At this point, you have a movie table, but no data: that information currently resides in movie_titles.txt over in the Keck lab's my.cs.lmu.edu server. How do we get that data into our movie table?
Recall that the SQL command for adding data has this format:
insert into table (columns) values (values)
In other words, this text data:
17761,2003,Levity 17762,1997,Gattaca 17763,1978,Interiors 17764,1998,Shakespeare in Love 17765,1969,Godzilla's Revenge 17766,2002,Where the Wild Things Are and Other Maurice Sendak Stories 17767,2004,Fidel Castro: American Experience 17768,2000,Epoch 17769,2003,The Company 17770,2003,Alien Hunter
...must be made to look like this:
insert into movie(id, year, title) values (17761,2003,'Levity'); insert into movie(id, year, title) values (17762,1997,'Gattaca'); insert into movie(id, year, title) values (17763,1978,'Interiors'); insert into movie(id, year, title) values (17764,1998,'Shakespeare in Love'); insert into movie(id, year, title) values (17765,1969,'Godzillas Revenge'); insert into movie(id, year, title) values (17766,2002,'Where the Wild Things Are and Other Maurice Sendak Stories'); insert into movie(id, year, title) values (17767,2004,'Fidel Castro: American Experience'); insert into movie(id, year, title) values (17768,2000,'Epoch'); insert into movie(id, year, title) values (17769,2003,'The Company'); insert into movie(id, year, title) values (17770,2003,'Alien Hunter');
Note a few rules here that might not be very obvious:
- Text values need to be enclosed between single quotes; numbers don't.
- Note, in the case of movie 17765, Godzilla's Revenge, that the movie title itself contains a single quote (apostrophe). To distinguish an "in-text" apostrophe from a "wrapper" apostrophe, we double it up; that's why the SQL above shows Godzilla's Revenge with the apostrophe converted into two.
- When performing multiple SQL queries, the semicolon (;) is used to distinguish one query from another. Think of the semicolon as playing the same role that periods (.) do in "normal" sentences.
Hmmmmm...do we have a program that can do this? We need to take the lines in the movie_titles.txt file and convert them into valid SQL insert commands. Why yes we do, and you have already used it: sed.
Since this is a tutorial, we won't spend time to explain exactly how we come up with the sed command below. But you should be able to see the pieces:
- We need to turn all single apostrophes into doubles.
- We then need to "wrap" the titles at the end around single quotes.
- Finally, we append the insert command before each line, and
- end each line with a parenthesis and semicolon.
Finally, we somehow need to get the data into the PostgreSQL server running on your workstation. For this, we will use the built-in web server of the my.cs.lmu.edu host: we will deposit the results of sed into a file that a web browser can then display. From the web browser, we can copy and paste the insert statements into the pgAdmin III SQL window.
All that said (no pun intended), this is the command that you want (invoke it from ~dondi/xmlpipedb/data):
cat movie_titles.txt | sed "s/'/''/g" | sed "s/,/,'/2" | sed "s/^/insert into movie(id,year,title) values(/g" | sed "s/$/');/g" > ~/public_html/movie.sql.txt
At this point, you can probably work out the sed commands. The portion we will explain just a little bit more is the last section, > ~/public_html/movie.sql.txt
. We have not needed to use the > symbol before, but now it is just what we need: it "sends" the result of the prior sed commands into a file. That file is placed in your public_html folder, which, if you recall, is visible on the web as http://my.cs.lmu.edu/~your_username/.
Insert!
Finally, we can feed these 17,770 insert statements (quick, how did we know this?) into PostgreSQL. Open a new browser tab or window and go to http://my.cs.lmu.edu/~username/movie.sql.txt (remember to substitute username with your Keck lab ssh/PuTTY login. You should see your fresh sed product in the browser. From here, Select All and Copy the commands.
Switch to the pgAdmin III SQL window, empty out the SQL Editor tab, then Paste your insert statements into the tab. Finally, execute the query (the green play button, remember?) and let it work. With 17,770 records, this takes a little bit longer than prior commands that you have run.
In the end, the Messages tab in the Output Pane should say something like:
Query returned successfully: one row affected, 2325 ms execution time.
(exact execution times will vary)
Once more, check your work; re-execute this:
select * from movie
This time, you should see a fully-populated Data Output tab, with...17,770 rows.
Time to play!
An Introduction to SQL
Of course, “true” database activities are triggered via SQL commands. The previous SQL PDF handout gives you more of a reference/nutshell view of SQL; this page walks you through some commands step-by-step, using the kerfuffle database that has been preloaded into the Keck lab for you. Then, you’ll practice making your own tables within that database.
Select Queries
The R in CRUD, “retrieve,” is, somewhat inconsistently, performed by the select command. The select command is the SQL “kitchen sink” for retrieving information from a database. It’s general, basic form is:
select <columns> from <tables> where <conditions>;
As you will see, select can do even more, but let’s start simple.
For this portion, make sure that you are connected to the kerfuffle database.
Basics
The simplest type of select queries involve getting records from an individual table based on relatively simple conditions. For example:
select year from movie where title = 'Metropolis';
...will retrieve the release years of movies whose titles are exactly “Metropolis.” If you try this query, you should see two rows: one for 2001 and another for 1927.
In addition to equality, like lends further flexibility. The like comparison allows for pattern matching, similar but not identical to grep. In SQL, the percent sign (“%”) is a “wildcard” that can represent any number of letters and symbols. like and % can be combined for broader queries, such as this one, which retrieves all movie titles that have the word “Vampire” in them:
select title from movie where title like '%Vampire%';
psql’s data display works a lot like more: if there’s more data than can fit in the command window, psql will pause and recognize the following keys:
- Enter moves through the data one line at a time
- The space bar moves through the data one window at a time
- q exits the data display
If you want select to display all columns of a database record, use the asterisk (*):
select * from movie where year like '196%';
This will display every column/field of movie records whose year starts with “196” — effectively, all movies released in the 1960s.
Conditions can be combined via and and or; for example, retrieving movies whose title contains either “DNA” or “Bio” can be done with:
select * from movie where title like '%DNA%' or title like '%Bio%';
When using more than two conditions, watch out for how conditions are grouped together; this query, for example, may yield unexpected results:
select * from movie where title like '%DNA%' or title like '%Bio%' and year like '200%';
At face value, the above query may read like “movies whose titles have either DNA or Bio which were released on or after 2000;” in reality, the year criterion is only and-ed with movies that have “Bio” in the title. Thus, the database actually interprets this query as “movies whose titles have DNA, or whose titles have Bio which were released on or after 2000.”
To eliminate any ambiguities, use parentheses to group conditions together:
select * from movie where (title like '%DNA%' or title like '%Bio%') and year like '200%';
These parentheses force the database to pick out movies with either title first, and then check if these movies have years beginning with “200.”
The like comparator can do simple text matches, but it does not use regular expressions (i.e., the search patterns recognized by grep and sed). This area is a little shaky in SQL-land; there is an official similar to comparator which is the official way to make regular expression comparisons, but the format for those expressions is not the same as the format used by grep and sed.
Fortunately, PostgreSQL has a specific, PostgreSQL-only comparator that does match the same patterns used by grep and sed: the tilde (~). Comparing with ~ is equivalent to a grep- or sed-like comparison:
select * from movie where title ~ 'Vamp[iy]re'; select * from movie where title ~ 'End$'; select * from movie where title ~ 'Colou?r';
The caveat here is that ~ is a PostgreSQL-specific feature: if you move to other database systems (such as Microsoft Access), that feature may either be done differently or missing completely, since it is not part of the official SQL standard.
Sorting
As you play with various queries on the movie table, you’ve probably noticed that results are returned in no particular order; if you’d like to sort the results in some way, tack on an order by clause at the end of the select query:
select * from movie where year like '200%' order by year;
This will display the records/rows for movies released from 2000 to 2009, sorted by year. You can add more fields for a very specific sort order:
select * from movie where year like '200%' order by year, title;
The above query returns the same records, but this time sorted by year first, then by title within each year.
Sort order is ascending by default (e.g., A to Z, 0 to 9); for the reverse order, add desc to the field(s) that you’d like to see sorted in reverse:
select * from movie where year like '200%' order by year desc, title;
This will display records with years displayed most recent first; within each year, however, titles will still be sorted in ascending order.
Aggregate Queries
For large databases, SQL provides aggregate (a.k.a. “grouping”) queries that summarize multiple records in different ways.
The simplest form of summary is counting: how many records were retrieved? A simple overall count is done by using count(*) as the thing to select (or project, in formal relational algebraic terms):
select count(*) from movie where year like '200%';
This will display the number of movies whose release years start with “200.”
Creating Tables
Now that you’ve logged some time with retrieving data, you’re probably aware that someone must have put the data there in the first place. First, the table(s) holding the data must be set up, or created — this is the C in CRUD. Half of table creation actually occurs without a computer: you (or the database designer) must determine the schema for the table independently of the database. This schema consists of the fields in the table, the kind of data each field will hold (numbers, text, dates, etc.), and any primary or foreign keys in the table. Ideally, you should have a complete relational schema diagram, such as the one shown in the handout or worked on previously in class.
With this table schema, creating the table in SQL is a matter of converting the diagram into a create table command:
create table <tablename> (<columns, their data types, and the primary key, separated by commas>);
If you’re still connected to the kerfuffle database, disconnect via \q, then run psql again, this time all by itself. This should connect you to your individual database.
Suppose you’ve designed a person table that looks like this:
id firstname lastname dob
...where id should be an integer, firstname and lastname can be any text, and dob is some date. In SQL, an “integer” is indicated by int, text is indicated by varchar, and a date is indicated by (for once) date. Further, based on the shading, the primary key of this table is id. Putting this all together yields the following commands:
create table person (id int primary key, firstname varchar, lastname varchar, dob date);
Once done, you will now have an empty table called person, showing the columns stated above.
Adding Data
Now that a table exists to receive data, the data itself has to be “handed” to the table. The SQL insert command accomplishes this:
insert into <table>(<columns>) values(<specific column values, typically enclosed in apostrophes>);
For example, in order to add a person named John Smith to the new person table, with numeric ID 1000 and a birthdate of June 30, 1980, you would perform:
insert into person(id, firstname, lastname, dob) values(1000, 'John', 'Smith', '6/30/1980');
Note how this insert command adds one record at a time; adding records en masse in this manner is a matter of forming a bunch of insert commands and having psql perform them all, one at a time.
As much as you like, use the up arrow to retrieve, then edit the previous insert command so that it adds a new record.
Note that listing the columns of the table is not redundant; this listing allows you to change the order used for the succeeding values (e.g., lastname first, then firstname, id, and dob). Make sure to change the value of id to something that has not been added to the record yet — it has to be unique to the entire table since it has been designated as the primary key.
Deleting Data
Use this last SQL command with caution, because it results in loss of data (which, in this one case, is what you may want to happen—but just this time):
delete from <table name>;
This command empties the table of all records, allowing you to start fresh. To delete records selectively, you may include a where
clause just like the one used in select
.
The Notorious Apostrophe
You might have noticed that, because the apostrophe or single quote is used to indicate specific values in SQL (e.g., 'The Godfather', 'Smith', '6/30/1980', etc.), we run into a potential problem when the value itself should contain an apostrophe. This is not as uncommon as one might think; for example, a good number of movie titles have apostrophes (By Dawn's Early Light, Zatoichi's Conspiracy, Logan's Run, and Dead Men Don't Wear Plaid, to name a few), as do many names (“O'Malley,” “M'Benga,” “D'Angelo”). An insert command such as the one below will result in an error, since the apostrophe will be misinterpreted as ending a piece of text rather than as part of the text itself:
insert into person(id, firstname, lastname, dob) values(2000, 'Beverly', 'D'Angelo', '8/21/1960');
Fortunately, SQL has a solution: apostrophes inside text should be indicated via two consecutive apostrophes, or ''. When encountered, SQL converts this pair of apostrophes into a single one, and does not interpret these apostrophes as ending a piece of text. Thus, the above command will work if rewritten in this way:
insert into person(id, firstname, lastname, dob) values(2000, 'Beverly', 'D''Angelo', '8/21/1960');
While the solution does exist, it isn’t automatic: you need to be aware that apostrophes have to be written as “double apostrophes” before passing any text values on to SQL. Keep this in mind when trying to load data from a text file into a database table.