Imacarae Week 10

From LMU BioDB 2019
Jump to navigation Jump to search

Imacarae's User Page

Assignment Shared Entries Individual Entries
Week 1 Class Journal Week 1 ----
Week 2 Class Journal Week 2 Imacarae Week 2
Week 3 Class Journal Week 3 HSF1/YGL073W Week 3
Week 4 Class Journal Week 4 Imacarae Week 4
Week 5 Class Journal Week 5 CancerSEA Week 5
Week 6 Class Journal Week 6 Imacarae Week 6
Week 7 Class Journal Week 7 Imacarae Week 7
Week 8 Class Journal Week 8 Imacarae Week 8
Week 9 Class Journal Week 9 Imacarae Week 9
Week 10 Class Journal Week 10 Imacarae Week 10
Week 11 Sulfiknights Imacarae Week 11
Week 12/13 Sulfiknights Sulfiknights DA Week 12/13
---- Sulfiknights Sulfiknights DA Week 14

Purpose

  • To gain experience with database queries and modeling, in preparation for the final project.
  • To contribute to creating the team's home page and getting organized for the final project.

Methods and Results

See Week 7, Week 8, and Week 9 for previous methods

Week 10

Creating the GRNmap Input Workbook

  • I generated the input Excel workbook that I will run in the GRNmap modeling software.
  • The sample workbook was downloaded from here as an example of what our workbook should look like.

Creating the "production_rates" sheet

  • This sheet contains initial guesses for the production rate parameters, P, for all genes in the network.
  • Assuming that the system is in steady state with the relative expression of all genes equal to 1, (P/2) - lambda = 0, where lambda is the degradation rate, is a reasonable initial guess.
  • The sheet contained two columns (from left to right) entitled, "id", "production_rate".
    • The id is an identifier that the user will use to identify a particular gene. In thiscase, we are using the "StandardName", for example, GLN3.
    • The "production_rate" column should then contain the initial guesses for the P parameter as described above, rounded to four decimal places.
      • The production rates are provided in a Microsoft Access database, which you can download from here.
      • I performed a query to get the list of production rates for each gene as a group.
      • To perform the query, I followed these steps:
        1. Imported the list of genes to a new table in the database. Clicked on the "External Data" tab and selected the Excel icon with the "up" arrow on it.
        2. Clicked the "Browse" button and selected my Excel file containing my network that I used to upload to GRNsight.
        3. I made sure the button next to "Import the source data into a new table in the current database" and clicked "OK".
        4. In the next window, selected the "network" worksheet, if it hasn't already been automatically selected for me. Clicked "Next".
        5. In the next window, I made sure the "First Row Contains Column Headings" is checked. Clicked "Next".
        6. In the next window, the left-most column will be highlighted. I changed the "Field Name" to "id" if it didn't say that already. Clicked "Next".
        7. In the next window, selected the button for "Choose my own primary key." and chose the "id" field from the drop down next to it. Clicked "Next".
        8. In the next field, make sure it says "Import to Table: network". Clicked Finish.
        9. In the next window I did not need to save the import steps, so I just clicked "Close".
        10. A table called "network" appeared in the list of tables at the left of the window.
        11. Go to the "Create" tab. I clicked on the icon for "Query Design".
        12. In the window that appears, I clicked on the "network" table and clicked "Add". I clicked on the "production_rates" table and clicked "Add". Clicked "Close".
        13. The two tables appeared in the main part of the window. To tell Access which fields in the two tables correspond to each other, clicked on the word "id" in the network table and dragged the mouse to the "standard_name" field in the "production_rates" table, and release. I saw a line appear between those two words.
        14. Right-clicked on the line between those words and selected "Join Properties" from the menu that appeared. Selected Option "2: Include ALL records from 'network' and only those records from 'production_rates' where the joined fields are equal." Clicked "OK".
        15. Clicked on the "id" word in the "network" table and dragged it to the bottom of the screen to the first column next to the word "Field" and release.
        16. Clicked on the "production_rate" field in the "production_rates" table and dragged it to the bottom of the screen to the second column next to the word "Field" and release.
        17. Right-clicked anywhere in the gray area near the two tables. In the menu that appears, I selected "Query Type > Make Table Query...".
        18. In the window that appears, I named the table "production_rates_1" because I can't have two tables with the same name in the database. I made sure that "Current Database" is selected and clicked "OK".
        19. Go to the "Query Tools: Menus" tab. Click on the exclamation point icon. A window will appear that tells you how many rows you are pasting into a new table. Click "Yes".
        20. The new "production_rates_1" table will appear in the list at the left. Double-clicked on that table name to open it.
        21. I copied the data in this table and pasted it back into your Excel workbook. I made sure that when I pasted that, I used "Paste Special > Paste values" so that the Access formatting doesn't get carried along. I also chose to export this table to Excel going to the "External Data" tab and selecting the Excel icon with the arrow pointing to the right. Select the workbook you want to export the table to, making sure that "Preserve Access formatting" is not checked. Click "OK", click "Close".
  • If there are missing values, substitute the value 0.1980 for the missing production rates.
  • The genes should be listed in the same order in all the sheets in the Excel workbook.

Creating the "degradation_rates" sheet

  • This sheet contained degradation rates for all genes in the network, which are provided by the user.
  • Currently, the Dahlquist Lab is using data based on published mRNA half-life data from Neymotin et al. (2006).
    • They converted the half-life data values to the degradation rates by taking the natural log of the half-life and dividing by 2.
  • The sheet contained two columns (from left to right) entitled "id" and "degradation_rate".
    • The id is an identifier that the user will use to identify a particular gene.
    • The "degradation_rate" column contained the absolute value of the degradation rate for the corresponding gene as described above, rounded to four decimal places.
      • To obtain these values, I used the same file, Microsoft Access database that I used to obtain the production rates in the first worksheet. Again, I followed the instructions to execute a query, substituting the appropriate "degradation_rates" table in the query. I didn't need to re-import my "network" table, I just created and executed the query.
  • Again, the genes were listed in the same order in all the sheets in the Excel workbook.
  • If there are missing values, I substituted the value 0.0990 for the missing degradation rates.

Expression Data Sheets for Individual Yeast Strains

  • Expression data was provided for either a single strain or multiple strains of yeast (for example, the wild type strain and a transcription factor deletion strain).
    • Each strain had its own sheet in the workbook.
    • Each sheet was given a unique name that follows the convention "STRAIN_log2_expression", where the word "STRAIN" was replaced by the strain designation, which appeared in the optimization_diagnostics sheet.
      • Everyone in the class had at least one expression worksheet called "wt_log2_expression".
      • I included the transcription factors GLN3, HAP4, and CIN5 in your network. I used the expression data from the dGLN3, dHAP4, dCIN5 deletion strains in my workbooks as well, naming the worksheets "dgln3_log2_expression", "dhap4_log2_expression", and "dcin5_log2_expression".
  • The sheet had the following columns in this order:
    1. "id": list of all genes. The genes should be listed in the same order in all the sheets in the Excel workbook.
    2. The next series of columns contained the expression data for each gene at a given timepoint given as log2 ratios (log2 fold changes). The column header was the time at which the data were collected, without any units. For example, the 15 minute timepoint would have a column header "15" and the 30 minute timepoint would have the column header "30". GRNmap supports replicate data for each of the timepoints. Replicate data for the same timepoint were in columns immediately next to each other and had the same column headers. For example, three replicates of the 15 minute timepoint would have "15", "15", "15" as the column headers.
    3. If data are provided for multiple strains, each strain had data for the same timepoints, although the number of replicates can vary.
  • I included the data for the 15, 30, and 60 minute timepoints, but not the 90 or 120 minute timepoints.
  • The data I used is contained in the Expression-and-Degradation-rate-database_2019.accdb file that I used to obtain the production and degradation rates.
  • I executed a query in Microsoft Access. I followed the steps listed for the "production_rates" sheet for each strains expression data. After I imported the data into Excel, I changed the column headers to "15", "15", etc., as described above.
  • Missing values in the expression data sheets are OK.

Creating the "network" sheet

  • The network you derived from the YEASTRACT database for the Week 9 assignment was copied and pasted into this sheet directly. The description below just explains what is already in this worksheet.
    • This sheet contained an adjacency matrix representation of the gene regulatory network.
    • The columns corresponded to the transcription factors and the rows corresponded to the target genes controlled by those transcription factors.
    • A “1” means there is an edge connecting them and a “0” means that there is no edge connecting them.
    • The upper-left cell (A1) contained the text “cols regulators/rows targets”. This text is there as a reminder of the direction of the regulatory relationships specified by the adjacency matrix.
    • The rest of row 1 should contain the names of the transcription factors that are controlling the other genes in the network, one transcription factor name per column.
    • The rest of column A should contain the names of the target genes that are being controlled by the transcription factors heading each of the columns in the matrix, one target gene name per row.
    • The transcription factor names corresponded to the "id" in the other sheets in the workbook. They should be capitalized the same way and occur in the same order along the top and side of the matrix. The matrix needs to be symmetric, i.e., the same transcription factors should appear along the top and left side of the matrix. The genes should be listed in the same order in all the sheets in the Excel workbook.
    • Each cell in the matrix should then contain a zero (0) if there is no regulatory relationship between those two transcription factors, or a one (1) if there is a regulatory relationship between them. Again, the columns correspond to the transcription factors and the rows correspond to the target genes controlled by those transcription factors.

Creating the "network_weights" sheet

  • These were the initial guesses for the estimation of the weight parameters, w.
  • Since these weights were initial guesses which will be optimized by GRNmap, the content of this sheet was identical to the "network" sheet.

Creating the "optimization_parameters" sheet

  • The optimization_parameters sheet has two columns (from left to right) entitled, "optimization_parameter" and "value".
  • I copied this worksheet from the sample workbook provided. The only row that I modified is row 15, "Strain". I included just the dgln3, dhap4, and dcin5 strain designations.
  • What follows below is an explanation of what the optimization_parameters mean.
    • alpha: Penalty term weighting (from the L-curve analysis)
    • kk_max: Number of times to re-run the optimization loop. In some cases re-starting the optimization loop can improve performance of the estimation.
    • MaxIter: Number of times MATLAB iterates through the optimization scheme. If this is set too low, MATLAB will stop before the parameters are optimized.
    • TolFun: How different two least squares evaluations should be before the program determines that it is not making any improvement
    • MaxFunEval: maximum number of times the program will evaluate the least squares cost
    • TolX: How close successive least squares cost evaluations should be before the program determines that it is not making any improvement.
    • production_function: = Sigmoid (case-insensitive) if sigmoidal model, =MM (case-insensitive) if Michaelis-Menten model
    • L_curve: =0 if an L-curve analysis should NOT be run or =1 if an L-curve analysis SHOULD be run. The L-curve analysis will automatically run sequential rounds of estimation for an array of fixed alpha values (0.8, 0.5, 0.2, 0.1,0.08, 0.05,0.02,0.01, 0.008, 0.005, 0.002, 0.001, 0.0008, 0.0005, 0.0002, and 0.0001). GRNmap makes a copy of the user's selected input workbook and changes alpha to the first alpha in the list. The estimation runs and the resulting parameter values are used as the initial guesses for the next round of estimation with the next alpha value. This process repeats until all alpha values have been run. New input and output workbooks are generated for each alpha value, although currently, the graphs are only saved for the last run.
    • estimate_params =1 if want to estimate parameters and =0 if the user wants to do just one forward run
    • make_graphs =1 to output graphs; =0 to not output graphs
    • fix_P =1 if the user does not want to estimate the production rate, P, parameter, just use the initial guess and never change; =0 to estimate
    • fix_b =1 if the user does not want to estimate the b parameter, just use the initial guess and never change; =0 to estimate
    • expression_timepoints: A row containing a list of the time points when the data was collected experimentally. Should correspond to the timepoint column headers in the STRAIN_log2_expression sheets.
    • Strain: A row containing a list of all of the strains for which there is expression data in the workbook. Should correspond to the "STRAIN" portion of the names of the STRAIN_log2_expression sheets for each strain. Note that GRNmap will run the model for the wild type network (all genes present in the network) and for networks where the gene deleted from the designated STRAIN has been deleted from the network.
    • simulation_timepoints: A row containing a list of the time points at which to evaluate the differential equations to generate the simulated data. This does not need to correspond to the actual measurement times, but should be in the same units (e.g. minutes).

Creating the "threshold_b" sheet

  • These are the initial guesses for the estimation of the threshold_b parameters.
  • There should be two columns.
    • The left-most column should contain the header "id" and list the standard names for the genes in the model in the same order as in the other sheets.
    • The second column should have the header "threshold_b" and should contain the initial guesses, we are going to use all 0.

Dynamical Systems Modeling of your Gene Regulatory Network

The worksheet is ready to run the model and analyze the results. The software I used is called GRNmap, which stands for Gene Regulatory Network Modeling and Parameter Estimation. It is written in MATLAB and can be run from code or run as a stand-alone executable if I didn't have MATLAB installed. However, it can only be run in Windows, not on Macs.

  • To run GRNmap from code, I ran MATLAB R2014b.
    1. I downloaded the GRNmap v1.10 code from the GRNmap Downloads page.
    2. I unzipped the file. (Right-click, 7-zip > Extract here)
    3. I launched MATLAB R2014b.
    4. I opened GRNmodel.m, which will be in the directory that I unzipped GRNmap-1.10 > matlab
    5. I clicked the Run button (green "play" arrow).
    6. I selected the input workbook.
    7. I saw an optimization diagnostics graphic that shows the progress of the estimation.
    8. Expression plots will display when the run was over.
    9. Output .xlsx and .mat files was saved in the same folder as my input folder, along with .jpg files containing the optimization diagnostic and individual expression plots. I saved these files.
    10. Note that if I needed to run GRNmap again, I did not use the same directory for the input file. Currently, GRNmap will overwrite previous output.
  • I uploaded my output .xlsx file into GRNsight to visualize the results!

Data and Files

GRNmap Excel Worksheet

Zipped Microarray Analysis

Conclusion

For this week, there was a lot of data organization. Not only did we learn how inputing can be user friendly, but we also learned how it can be user hostile. All details must exactly fit what's in the directions or else the MATLAB will not run. The microarray data for Profile 45 of dCIN5 was analyzed through MATLAB. We can now organize the transcription factors in relation to each other and then determine how they are involved in the cold response.

Acknowledgments

  • To Dr. Dahlquist for providing us with background information needed for organizing the data with Excel and Access.
    -Procedural steps were copied from the Week 10 assignment page and modified to fit the specific experiment.
  • To my group members, DeLisa, Mihir, and Emma. We met during class time to walk through the procedure and to delegate what profile we were going to do individually.
  • Except for what is noted above, this individual journal entry was completed by me and not copied from another source.

Imacarae (talk) 19:06, 6 November 2019 (PST)

References