Difference between revisions of "Nanguiano Week 6"

From LMU BioDB 2015
Jump to: navigation, search
(The FDA Drug Database: added more steps. beginning processing of the files.)
(The FDA Drug Database: added text analysis)
Line 14: Line 14:
 
  unzip UCM054599.zip
 
  unzip UCM054599.zip
 
* Next, I needed to move the application.txt and Product.txt files into my public_html folder. However, I first wanted to create a folder for the files. I entered my public_html folder and created a folder for this class, biodb.  
 
* Next, I needed to move the application.txt and Product.txt files into my public_html folder. However, I first wanted to create a folder for the files. I entered my public_html folder and created a folder for this class, biodb.  
  cd ../../public_html
+
  cd ~/public_html
 
  mkdir biodb
 
  mkdir biodb
 
*I noticed after making the folder that the "movie.sql.txt" file from a few classes ago was still there. I moved that file into the biodb folder, then returned to my week6 folder to begin the transfer of the files I needed to move.
 
*I noticed after making the folder that the "movie.sql.txt" file from a few classes ago was still there. I moved that file into the biodb folder, then returned to my week6 folder to begin the transfer of the files I needed to move.
 
  mv movie.sql.txt biodb
 
  mv movie.sql.txt biodb
  cd ../biodb/week6
+
  cd ~/biodb/week6
* Before moving the files, I needed to process them to prepare them for entry into PostgreSQL.
+
* I moved the files into the prepared folder in the public_html directory.
<!--
+
mv application.txt ~/public_html/biodb
 
+
mv Product.txt ~/public_html/biodb
* Upon successful unzipping, you should see the following files:
+
* Next, I needed to process the files to prepare them for entry into PostgreSQL. The files needed to be prepared with the insert commands so that they could be added into my database. Before doing this, i needed to ensure I knew the structure of the tables that I was going to enter the files into. To begin this process, I downloaded the two files and opened them in Excel so I could see what the header of each column was, and what type of data it contained.
 
+
** Opening the "application.txt" file showed me that it contained 9 columns: ApplNo (int), ApplType (char), SponsorApplicant (varchar), MostRecentLabelAvailableFlag (bool), CurrentPatentFlag (bool), ActionType (varchar), ChemicalType (int), Ther_Potential (char), and Orphan_Code (char).  
    $ ls
+
** Opening the "Product.txt" file showed me that it also contained 9 columns: ApplNo (int), ProductNo (int), Form (varchar), Dosage (varchar), ProductMktStatus (int), TECode (varchar), ReferenceDrug (int), drugname (varchar), and activeingred (varchar).
    AppDoc.txt            application.txt    DocType_lookup.txt  Product.txt        ReviewClass_Lookup.txt
+
    AppDocType_Lookup.txt  ChemTypeLookup.txt  Product_tecode.txt  RegActionDate.txt  UCM054599.zip
+
 
+
: As stated in the tasks above, you will only need to work with the ''application.txt'' and ''Product.txt'' files.
+
 
+
Part of the assignment, of course, is to determine how to rework these files so that their data can be loaded into PostgreSQL via the pgAdmin III desktop application on the Seaver 120 workstations. Just as with the ''movie_titles.txt'' file, you do this by “redirecting” the processed data to your ''~/public_html'' folder:
+
 
+
    cat application.txt | ''your command sequence here'' > ~/public_html/application.sql.txt
+
    cat Product.txt | ''your command sequence here'' > ~/public_html/Product.sql.txt
+
 
+
You can then download these files to the Seaver 120 desktop environment by visiting these sites with a web browser:
+
 
+
* <nowiki>http://my.cs.lmu.edu/~</nowiki>''username''/application.sql.txt
+
* <nowiki>http://my.cs.lmu.edu/~</nowiki>''username''/Product.sql.txt
+
 
+
Again, note that these latter steps exactly mirror the tutorial for the ''movie_titles.txt'' file, so please refer to [[PostgreSQL_Tutorial|that wiki page]] for additional details.-->
+
 
+
=== Supplementary Information ===
+
 
+
<!--Just as with [[Week 4]], real-world data have their share of variations and exceptions. For these files:
+
* To visualize the raw data more easily, you can load the files into Microsoft Excel. ''But do '''not''' use Excel to process them into SQL statements.''
+
* A <code>bit</code> column can either be a <code>boolean</code> or <code>int</code> data type in PostgreSQL—examine the data to see what values are in there in order to make the right choice.
+
* If you see “nulls” in the website schema, that simply means that the value can be empty.
+
* For our purposes, you don’t need to define a primary key for the ''Product'' table. It does, however, have a foreign key.
+
* The first lines of all of these files contain the column names. You don’t need them because the SQL <code>insert</code> statement, as shown in class and in the [[PostgreSQL Tutorial]], specifies these already.
+
* Instead of commas as in the ''movie_titles.txt'' file, the columns in these files are separated by ''tabs''. You can indicate a ''tab'' in <code>sed</code> using the characters <code>\t</code>.
+
* Due to the way these files are formatted, the “end of the line” should be designated using the pattern <code>\r$</code> and not the dollar sign by itself.
+
* PostgreSQL can handle numbers and booleans with or without single quotes, so both <code>5</code> and <code>'5'</code> are valid. In addition, boolean (true/false) values are case-insensitive, so <code>'True'</code> and <code>'False'</code> will be acceptable to the system.-->
+
  
 
=== Questions to Answer ===
 
=== Questions to Answer ===

Revision as of 22:35, 13 October 2015

The FDA Drug Database

Direct Download/Unzipping Commands

  • First, to begin the assignment, I ssh'd into my server.
ssh nanguiano@lion.lmu.edu
  • Next, I entered my folder for this class and created a directory for this week's assignment, then entered the directory.
cd biodb
mkdir week6
cd week6
  • In the directory, I downloaded the file from the FDA.
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
  • After the download completed, I unzipped the file.
unzip UCM054599.zip
  • Next, I needed to move the application.txt and Product.txt files into my public_html folder. However, I first wanted to create a folder for the files. I entered my public_html folder and created a folder for this class, biodb.
cd ~/public_html
mkdir biodb
  • I noticed after making the folder that the "movie.sql.txt" file from a few classes ago was still there. I moved that file into the biodb folder, then returned to my week6 folder to begin the transfer of the files I needed to move.
mv movie.sql.txt biodb
cd ~/biodb/week6
  • I moved the files into the prepared folder in the public_html directory.
mv application.txt ~/public_html/biodb
mv Product.txt ~/public_html/biodb
  • Next, I needed to process the files to prepare them for entry into PostgreSQL. The files needed to be prepared with the insert commands so that they could be added into my database. Before doing this, i needed to ensure I knew the structure of the tables that I was going to enter the files into. To begin this process, I downloaded the two files and opened them in Excel so I could see what the header of each column was, and what type of data it contained.
    • Opening the "application.txt" file showed me that it contained 9 columns: ApplNo (int), ApplType (char), SponsorApplicant (varchar), MostRecentLabelAvailableFlag (bool), CurrentPatentFlag (bool), ActionType (varchar), ChemicalType (int), Ther_Potential (char), and Orphan_Code (char).
    • Opening the "Product.txt" file showed me that it also contained 9 columns: ApplNo (int), ProductNo (int), Form (varchar), Dosage (varchar), ProductMktStatus (int), TECode (varchar), ReferenceDrug (int), drugname (varchar), and activeingred (varchar).

Questions to Answer

  1. Provide the DDL (create table) statements that you used for your application and product tables.
  2. Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
  3. Using the command line, how can you determine the number of records in each file? Provide the command.
  4. Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
  5. In your database, are these numbers the same or different? Explain why you think so.
    For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
  6. What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?
  7. What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?
  8. In what forms and dosages can the drug product named BENADRYL be administered?
  9. Which drug products have a name ending in ESTROL?
  10. Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
  11. Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
  12. What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?
  13. Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?

Links

Nicole Anguiano
BIOL 367, Fall 2015

Assignment Links
Individual Journals
Shared Journals