Cdomin12 Week 10
Contents
- 1 Purpose
- 2 Methods & Results
- 2.1 Creating the GRNmap Input Workbook
- 2.2 production_rates sheet
- 2.3 degradation_rates sheet
- 2.4 Expression Data Sheets for Individual Yeast Strains
- 2.5 network sheet
- 2.6 network_weights sheet
- 2.7 optimization_parameters sheet
- 2.8 threshold_b sheet
- 2.9 Dynamical Systems Modeling of your Gene Regulatory Network
- 3 Data and Files
- 4 Conclusion
- 5 Acknowledgments
- 6 References
Purpose
To create a workbook of collected data sheets in order for it to be correctly uploaded to MATLAB for modeling of gene data and GRNmap.
Methods & Results
Creating the GRNmap Input Workbook
Clicked here to download a sample workbook on which to base the one specific to your network and microarray data.
production_rates sheet
- This sheet contained initial guesses for the production rate parameters, P, for all genes in the network.
- Assuming that the system is in steady state with the relative expression of all genes equal to 1, (P/2) - lambda = 0, where lambda is the degradation rate, is a reasonable initial guess.
- The sheet contained two columns (from left to right) entitled, "id", "production_rate".
- The id is an identifier that the user will use to identify a particular gene. In our case, we used the "StandardName", for example, GLN3.
- The "production_rate" column contained the initial guesses for the P parameter as described above, rounded to four decimal places.
- The production rates wre provided in a Microsoft Access database, which were downloaded from here.
- Performed a query to get the list of production rates for each gene as a group.
- Imported list of genes to a new table in the database. Clicked on the "External Data" tab and selected the Excel icon with the "up" arrow on it.
- Clicked the "Browse" button and selected Excel file containing network that you used to upload to GRNsight.
- Made sure the button next to "Import the source data into a new table in the current database" and clicked "OK".
- In the next window, selected the "network" worksheet, if it hasn't already been automatically selected for you. Clicked "Next".
- In the next window, made sure the "First Row Contains Column Headings" is checked. Clicked "Next".
- In the next window, the left-most column will be highlighted. Changed the "Field Name" to "id" if it doesn't say that already. Clicked "Next".
- In the next window, selected the button for "Chose my own primary key." and chose the "id" field from the drop down next to it. Clicked "Next".
- In the next field, made sure it says "Import to Table: network". Clicked Finish.
- In the next window did not need to save the import steps, so just clicked "Close".
- A table called "network" appeared in the list of tables at the left of the window.
- Went to the "Create" tab. Clicked on the icon for "Query Design".
- In the window that appears, clicked on the "network" table and clicked "Add". Clicked on the "production_rates" table and clicked "Add". Clicked "Close".
- The two tables appeared in the main part of the window. Clicked on the word "id" in the network table and draged your mouse to the "standard_name" field in the "production_rates" table, and released. Saw a line appear between those two words.
- Right-clicked on the line between those words and selected "Join Properties" from the menu that appeared. Selected Option "2: Included ALL records from 'network' and only those records from 'production_rates' where the joined fields are equal." Clicked "OK".
- Clicked on the "id" word in the "network" table and dragged it to the bottom of the screen to the first column next to the word "Field" and released.
- Clicked on the "production_rate" field in the "production_rates" table and dragged it to the bottom of the screen to the second column next to the word "Field" and released.
- Right-clicked anywhere in the gray area near the two tables. In the menu that appears, selected "Query Type > Make Table Query...".
- In the window that appears, name your table "production_rates_1" because you can't have two tables with the same name in the database. Made sure that "Current Database" is selected and Clicked "OK".
- Went to the "Query Tools: Menus" tab. Clicked on the exclamation point icon. A window appeared that tells you how many rows you are pasting into a new table. Clicked "Yes".
- New "production_rates_1" table appeared in the list at the left. Double-clicked on that table name to open it.
- Copied the data in this table and pasted it back into your Excel workbook. Made sure that when pasted that used "Paste Special > Paste values" so that the Access formatting doesn't get carried along.
- Substituted the value
0.1980
for the missing production rates. - Note that the genes were listed in the same order in all the sheets in the Excel workbook.
degradation_rates sheet
- The sheet should contained two columns (from left to right) entitled "id", and "degradation_rate".
- The id is an identifier that the user will use to identify a particular gene.
- The "degradation_rate" column contained the absolute value of the degradation rate for the corresponding gene as described above, rounded to four decimal places.
- To obtain these values, used the same file, Microsoft Access database that used to obtain the production rates in the first worksheet. Again, copied and pasted the values one-by-one, substituting the appropriate "degradation_rates" table in the query. Note that didn't re-import your "network" table, just created and executed the query.
- Again note, the genes should be listed in the same order in all the sheets in the Excel workbook.
- Substituted the value
0.0990
for the missing degradation rates.
Expression Data Sheets for Individual Yeast Strains
- Expression data can be provided for either a single strain or multiple strains of yeast (for example, the wild type strain and a transcription factor deletion strain).
- Each strain had its own sheet in the workbook.
- Each sheet given a unique name that follows the convention "wt_log2_expression"
- Everyone in the class will have at least one expression worksheet called "wt_log2_expression".
- included the transcription factors GLN3, HAP4, and CIN5 in network. Named the worksheets "dgln3_log2_expression", "dhap4_log2_expression", and "dcin5_expression".
- The sheet had the following columns in this order:
- "id": list of all genes. The genes listed in the same order in all the sheets in the Excel workbook.
- The next series of columns contained the expression data for each gene at a given timepoint given as log2 ratios (log2 fold changes). The column header was the time at which the data were collected, without any units. For example, the 15 minute timepoint had a column header "15". Replicate data for the same timepoint was in columns immediately next to each other and have the same column headers.
- If data are provided for multiple strains, each strain should have data for the same timepoints, although the number of replicates can vary.
- Included the data for the 15, 30, and 60 minute timepoints, but not the 90 or 120 minute timepoints.
- The data you used is contained in the Expression-and-Degradation-rate-database_2019.accdb file that was used to obtain the production and degradation rates.
- Executed a query in Microsoft Access to do it for you. Followed the steps listed for the "production_rates" sheet for each strains expression data. Needed to change the column headers to "15", "15", etc., as described above.
- Missing values in the expression data sheets were OK
network sheet
- The network you derived from the YEASTRACT database for the Week 9 assignment was copied and pasted into this sheet directly.
- This sheet contained an adjacency matrix representation of the gene regulatory network.
- The columns corresponded to the transcription factors and the rows corresponded to the target genes controlled by those transcription factors.
- A “1” means there is an edge connecting them and a “0” means that there is no edge connecting them.
- The upper-left cell (A1) contained the text “cols regulators/rows targets”. This text was there as a reminder of the direction of the regulatory relationships specified by the adjacency matrix.
- The rest of row 1 contained the names of the transcription factors that are controlling the other genes in the network, one transcription factor name per column.
- The rest of column A should containedn the names of the target genes that are being controlled by the transcription factors heading each of the columns in the matrix, one target gene name per row.
- The transcription factor names corresponded to the "id" in the other sheets in the workbook. Were capitalized the same way and occur in the same order along the top and side of the matrix. The matrix was symmetric, i.e., the same transcription factors appeared along the top and left side of the matrix. The genes were listed in the same order in all the sheets in the Excel workbook.
- Each cell in the matrix contained a zero (0) if there is no regulatory relationship between those two transcription factors, or a one (1) if there is a regulatory relationship between them. Again, the columns corresponded to the transcription factors and the rows corresponded to the target genes controlled by those transcription factors.
network_weights sheet
- These were the initial guesses for the estimation of the weight parameters, w.
- Since these weights are initial guesses which will be optimized by GRNmap, the content of this sheet was identical to the "network" sheet.
optimization_parameters sheet
- The optimization_parameters sheet had two columns (from left to right) entitled, "optimization_parameter" and "value".
- Copied this worksheet from the sample workbook provided. Included just the strain designations for which you have a corresponding STRAIN_log2_expression sheet. Deleted dzap.
- What follows below is an explanation of what the optimization_parameters mean.
- alpha: Penalty term weighting (from the L-curve analysis)
- kk_max: Number of times to re-run the optimization loop. In some cases re-starting the optimization loop can improve performance of the estimation.
- MaxIter: Number of times MATLAB iterates through the optimization scheme. If this is set too low, MATLAB will stop before the parameters are optimized.
- TolFun: How different two least squares evaluations should be before the program determines that it is not making any improvement
- MaxFunEval: maximum number of times the program will evaluate the least squares cost
- TolX: How close successive least squares cost evaluations should be before the program determines that it is not making any improvement.
- production_function: = Sigmoid (case-insensitive) if sigmoidal model, =MM (case-insensitive) if Michaelis-Menten model
- L_curve: =0 if an L-curve analysis should NOT be run or =1 if an L-curve analysis SHOULD be run. The L-curve analysis will automatically run sequential rounds of estimation for an array of fixed alpha values (0.8, 0.5, 0.2, 0.1,0.08, 0.05,0.02,0.01, 0.008, 0.005, 0.002, 0.001, 0.0008, 0.0005, 0.0002, and 0.0001). GRNmap makes a copy of the user's selected input workbook and changes alpha to the first alpha in the list. The estimation runs and the resulting parameter values are used as the initial guesses for the next round of estimation with the next alpha value. This process repeats until all alpha values have been run. New input and output workbooks are generated for each alpha value, although currently, the graphs are only saved for the last run.
- estimate_params =1 if want to estimate parameters and =0 if the user wants to do just one forward run
- make_graphs =1 to output graphs; =0 to not output graphs
- fix_P =1 if the user does not want to estimate the production rate, P, parameter, just use the initial guess and never change; =0 to estimate
- fix_b =1 if the user does not want to estimate the b parameter, just use the initial guess and never change; =0 to estimate
- expression_timepoints: A row containing a list of the time points when the data was collected experimentally. Should correspond to the timepoint column headers in the STRAIN_log2_expression sheets.
- Strain: A row containing a list of all of the strains for which there is expression data in the workbook. Should correspond to the "STRAIN" portion of the names of the STRAIN_log2_expression sheets for each strain. Note that GRNmap will run the model for the wild type network (all genes present in the network) and for networks where the gene deleted from the designated STRAIN has been deleted from the network.
- simulation_timepoints: A row containing a list of the time points at which to evaluate the differential equations to generate the simulated data. This does not need to correspond to the actual measurement times, but should be in the same units (e.g. minutes).
threshold_b sheet
- These were the initial guesses for the estimation of the threshold_b parameters.
- There were two columns.
- The left-most column contained the header "id" and list the standard names for the genes in the model in the same order as in the other sheets.
- The second column had the header "threshold_b" and contained the initial guesses, we used 0.
Dynamical Systems Modeling of your Gene Regulatory Network
- To run GRNmap from code,installed on computer.
- Downloaded the GRNmap v1.10 code from the GRNmap Downloads page.
- Unziped the file. (Right-click, 7-zip > Extract here)
- Launched MATLAB R2014b.
- Opened GRNmodel.m, which will be in the directory that you unzipped GRNmap-1.10 > matlab
- Clicked the Run button (green "play" arrow).
- Prompted to select your input workbook.
- Saw an optimization diagnostics graphic that showed the progress of the estimation.
- When the run was over, expression plots displayed.
- Output .xlsx and .mat files were saved in the same folder as your input folder, along with .jpg files containing the optimization diagnostic and individual expression plots. Saved these files.
- Uploaded output .xlsx file into GRNsight to visualize the results!
Data and Files
- upon consulting with Dr. Dahlquist, this file was updated during Week 10 to fix previous errors
Conclusion
A workbook was successfully created so that a Dynamical Systems Modeling of my Gene Regulatory Network could be created. GRNmap worked successfully with the software to yield the modeling data. This will be used to further analyze how cold shock and cold shock recovery work specifically with the genes and transcription factors that were chosen.
Acknowledgments
1. I worked with User:Knguye66, User:Jcowan4, and User:Mavila9 for this assignment.
2."Except for what is noted above, this individual journal entry was completed by me and not copied from another source." Cdomin12 (talk) 15:56, 5 November 2019 (PST)
References
- Week 10. Retrieved November 6, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_10