Introduction
In this article we introduce a complete workflow for a typical (Affymetrix) microarray analysis. Data import, preprocessing, differential expression and enrichment analysis are discussed. We also introduce some necessary mathematical background on linear models along the way.
The data set used 1 is from a paper studying the differences between patients suffering from Ulcerative colitis (UC) or Crohn’s disease (CD). This is a typical clinical data set consisting of 14 UC and 15 CD patients from which inflamed and non–inflamed colonic mucosa tissue was obtained via a biopsy. Our aim is to analyze differential expression (DE) between the tissues in the two diseases.
Required packages and other preparations
Download the raw data from from ArrayExpress
The first step of the analysis is to download the raw data CEL files. These files are produced by the array scanner software and contain the probe intensities measured. The data have been deposited at ArrayExpress and have the accession code E-MTAB-2967.
Each ArrayExpress data set has a landing page summarizing the data set, and we use the ArrayExpress Bioconductor package to obtain the ftp links to the raw data files ( Data from Palmieri et. al. on ArrayEpress).
Information stored in ArrayExpress
Each dataset at ArrayExpress is stored according to the MAGE–TAB (MicroArray Gene Expression Tabular) specifications as a collection of tables bundled with the raw data. The MAGE–TAB format specifies up to five different types of files, namely the Investigation Description Format (IDF), the Array Design Format (ADF), the Sample and Data Relationship Format (SDRF), the raw data files and the processed data files.
For use, the IDF and the SDRF file are important. The IDF file contains top level information about the experiment including title, description, submitter contact details and protocols. The SDRF file contains essential information on the experimental samples, e.g. the experimental group(s) they belong to.
Download the raw data and the annotation data
With the code below, we download the raw data from
ArrayExpress
2. It is saved in the directory
We now import the SDRF file directly from ArrayExpress in order to obtain the sample annotation.
The raw data consists of one CEL file per sample (see below) and we use the CEL file names as row names for the imported data. These names are given in a column named
Before we move on to the actual raw data import, we will briefly introduce the
Figure 1.
Structure of Bioconductor’s ExpressionSet class.
Bioconductor ExpressionSets
Genomic data can be very complex, usually consisting of a number of different bits and pieces, e.g. information on the experimental samples, annotation of genomic features measured as well as the experimental data itself In Bioconductor the approach is taken that these pieces should be stored in a single structure to easily manage the data.
The package
Biobase contains standardized data structures to represent genomic data. The
The data in an ExpressionSet consist of
•
assayData: Expression data from microarray experiments. •
metaData: A description of the samples in the experiment (phenoData), metadata about the features on the chip or technology used for the experiment (featureData), and further annotations for the features, for example gene annotations from biomedical databases (annotation). •
experimentData: A flexible structure to describe the experiment.
The ExpressionSet class coordinates all of these data, so that one does not have to worry about the details. However, some constrains have to be met. In particular, the rownames of the
You can use the functions
Import of the raw microarray data
The analysis of Affymetrix arrays starts with CEL files. These are the result of the processing of the raw image files using the Affymetrix software and contain estimated probe intensity values. Each CEL file additionally contains some metadata, such as a chip identifier.
The function
We specify our
Finally, we check whether the object created is valid. (e.g. sample names match between the different tables).
We collect the information about the CEL files and import and them into the variable
We now inspect the raw data a bit and retain only those columns that are related to the experimental factors of interest (identifiers of the individuals, disease of the individual and the mucosa type).
Quality control of the raw data
The first step after the intial data import is the quality control of the data. Here we check for outliers and try to see whether the data clusters as expected, e.g. by the experimental conditions. We use the identifiers of the individuals as plotting symbols.
The PCA (performed on the log–intensity scale) plot of the raw data shows that the first principal component differentiates between the diseases. However, the intensity boxplots show that the intensity distributions of the individual arrays are quite different, indicating the need of an appropriate normalization, which we will discuss next.
A wide range of quality control plots can be created using the package arrayQualityMetrics 5. The package produces an html report, containing the quality control plots together with a description of their aims and an identification of possible outliers. We don’t discuss this tool in detail here, but the code below can be used to create a report for our raw data.
Figure 2.
PCA plot of the log–transformed raw data.
Figure 3.
Intensity boxplots of the log2–transformed raw data.
Background adjustment, calibration, summarization and annotation
Background adjustment
After the initial import and quality assessment, the next step in processing of microarray data is background adjustment. This is essential because a part of the measured probe intensities are due to non-specific hybridization and the noise in the optical detection system. Therefore, observed intensities need to be adjusted to give accurate measurements of specific hybridization.
Across–array normalization (calibration)
Without proper normalization across arrays, it is impossible to compare measurements from different array hybridizations due to many obscuring sources of variation. These include different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects, and laboratory conditions.
Summarization
After normalization, summarization is needed because on the Affymetrix platform transcripts are represented by multiple probes. For each gene, the background adjusted and normalized intensities need to be summarized into one quantity that estimates an amount proportional to the amount of RNA transcript.
After the summarization step, the summarized data can be annotated with various information, e.g. gene symbols and EMSEMBL gene identifiers. There is an annotation database available from Bioconductor for our platform, namely the package hugene10sttranscriptcluster.db.
You can view its content like this
Additional information is available from the reference manual of the package. Essentially, the package provides a mapping from the transcript cluster identifiers to the various annotation data.
Old and new “probesets” of Affymetrix microarrays
Traditionally, Affymetrix arrays (the so–called 3’ IVT arrays) were probeset based: a certain fixed group of probes were part of a probeset which represented a certain gene or transcript (note however, that a gene can be represented by multiple probesets).
The more recent “Gene” and “Exon” Affymetrix arrays are exon based and hence there are two levels of summarization. The exon level summarization leads to “probeset” summary. However, these probesets are not the same as the probesets of the previous chips, which usually represented a gene/transcript. Furthermore, there are also no longer designated match/mismatch probes present on “Gene” type chips.
For the newer Affymetrix chips a gene/transcript level summary is given by “transcriptct clusters”. Hence the appropriate annotation package is called hugene10sttranscriptcluster.db.
To complicate things even a bit more, note that the “Gene” arrays were created as affordable versions of the “Exon” arrays by taking the “good” probes from the Exon array. So the notion of a probeset is based on the original construction of the probesets on the Exon array, which contains usually at least four probes.
But since Affymetrix selected only a the subset of “good” probes for the Gene arrays, a lot of the probesets on the “Gene” arrays are made up of three or fewer probes. Thus, a summarization on the probeset/exon level is not recommended for “Gene” arrays but nonetheless possible by using the hugene10stprobeset.db annotation package.
One–go preprocessing in oligo
The package oligo allows us to perform background correction, normalization and summarization in one single step using a deconvolution method for background correction, quantile normalization and the RMA (robust multichip average) algorithm for summarization.
This series of steps as a whole is commonly referred to as RMA algorithm, although strictly speaking RMA is merely a summarization method 6– 8.
The parameter
Although other methods for background correction and normalization exist, RMA is usually a good default choice. RMA shares information across arrays and uses the versatile quantile normalization method that will make the array intensity distributions match. However, it is preferable to apply it only after outliers have been removed. The quantile normalization algorithm used by RMA works by replacing values by the average of identically ranked (with a single chip) values across arrays. A more detailed description can be found on the Wikipedia page about it.
An alternative to quantile normalization is the vsn algorithm, that performs background correction and normalization by robustly shifting and scaling log–scale intensity values within arrays 9. This is less “severe” than quantile normalization.
Some mathematical background on normalization (calibration) and background correction
A generic model for the value of the intensity
Y of a single probe on a microarray is given by
Figure 4.
PCA plot of the calibrated data.
Quality assessment of the calibrated data
We now produce a clustering and another PCA plot using the calibrated data. In order to display a heatmap of the sample–to–sample distances, we first compute the distances using the
The second PC roughly separates Crohn’s disease from ulcerative colitis, while the first separates the tissues. This is what we expect: the samples cluster by their experimental conditions. On the heatmap plot we also see that the samples do not cluster strongly by tissue, confirming the impression from the PCA plot that the separation between the tissues is not perfect. The stripe in the heatmap might correspond to outlier that could potentially remove. The arrayQualityMetrics package produces reports that compute several metrics that can be used for outlier removal.
Figure 5.
Heatmap of the sample to sample distances for the calibrated data.
Filtering based on intensity
We now filter out lowly expressed genes. Microarray data commonly show a large number of probes in the background intensity range. They also do not change much across arrays. Hence they combine a low variance with a low intensity. Thus, they could end up being detected as differentially expressed although they are barely above the “detection” limit and are not very informative in general. We will perform a “soft” intensity based filtering here, since this is recommended by limma’s 10, 11 user guide (a package we will use below for the differential expression analysis). However, note that a variance based filter might exclude a similar set of probes in practice. In the histogram of the gene–wise medians, we can clearly see an enrichment of low medians on the left hand side. These represent the genes we want to filter.
In order to infer a cutoff from the data, we inspect the histogram of the median–intensities. We visually fit a central normal distribution given by 0.5 · N(5.1, 1.18) to the probe–wise medians, which represents their typical behavior in the data set at hand.
Then we use the 5% quantile of this distribution as a threshold, We keep only those genes that show an expression higher than the threshold in at least as many arrays as in the smallest experimental group.
In our case this would be 14.
Figure 6.
Histogram of the median intensities per gene.
Annotation of the transcript clusters
Before we continue with the linear models for microarrays and differential expression we describe how to add “feature Data”, i.e. annotation information to the transcript cluster identifiers stored in the featureData of our ExpressionSet. We use the function
Removing multiple mappings and building custom annotations
Many transcript–cluster identifiers will map to multiple gene symbols. We compute a summary table in the code below to see how many there are.
We have over 2000 transcript–clusters that map to multiple gene symbols. It is difficult to decide which mapping is “correct”. Therefore, we exclude these transcript–clusters. Additionally, we also exclude transcript–clusters that do not map to gene symbols.
Alternatively, one can re–map the probes of the array to a current annotation, a workflow to do this for Illumina arrays is given in 12. Essentially, the individual probe sequences are re–aligned to an in–silico “exome” that consists of all annotated transcript exons.
In any case, the package
pdInfoBuilder can be used to build custom annotation packages for use with
oligo. In order to do this, PGF/CLF files (called “
The CLF file contains information about the location of individual probes on the array. The PGF file then contains the individual probe sequences and shows the probeset they belong to. Finally, The probeset annotation .csv then contains information about which probesets are used in which transcript cluster. Commonly, multiple probesets are used in one transcript cluster and some probesets are contained in multiple transcript clusters.
A short overview of linear models
I am afraid this section is rather technical. However general experience shows that most questions on the Bioconductor support site about packages using using linear models like limma 10, DESeq2 13 and edgeR 14 are actually not so much about the packages themselves but rather about the underlying linear models. It might also be helpful to learn a bit of linear algebra to understand the concepts better. The Khan Academy offers nice (and free) online courses. Mike Love’s and Michael Irizzary’s genomics class is also a very good resource, especially its section on interactions and contrasts.
Regression models
In regression models we use one variable to explain or predict the other. It is customary to plot the predictor variable on the x–axis and the predicted variable on the y–axis. The predictor is also called the independent variable, the explanatory variable, the covariate, or simply x. The predicted variable is called the dependent variable, or simply y.
In a regression problem the data are pairs (
x
i,
y
i) for
i = 1, . . . ,
n. For each
i,
y
i is a random variable whose distribution depends on
x
i. We write
The above expresses
y
i as a systematic or explainable part
g(
x
i) and an unexplained part
ε
i. Or more informally: response = signal + noise.
g is called the regression function. Once we have an estimate
Residuals are used to evaluate and assess the fit of models for g. Usually one makes distributional assumption about them, e.g. that they are independent and identically normally distributed with identical variance
σ
2 and mean zero:
Linear regression models
Linear regression is a special case of the general regression model. Here, we combine the predictors linearly to produce a prediction. If we have only single predictor
x, the simple linear regression model is:
With
X being the so called
design matrix:
Creating design matrices in R
To get an idea of what design matrices look like, we consider several examples. It is important to know some fundamentals about design matrices in order to be able to correctly transfer a design of a particular study to an appropriate linear model.
We will use the base R functions:
•
•
. . . in order to produce design matrices for a variety of linear models. R uses the formula interface to create design matrices automatically. In the first example, we have two groups of two samples each. Using
This design is called treatment contrast parameterization for an obvious reason: the first column of the design matrix represents a “base level”, i.e the mean β 0 for group one and the second column, corresponding to β 1, represents the difference between the group means since all group two samples have means represented by β 0 + β 1. As β 0 is the mean of group 1, β 1 corresponds to the difference of the means of group two and group one and thus shows the effect of a “treatment”.
However, this design is not orthogonal, i.e. the columns of the design matrix are not independent. We can construct an equivalent orthogonal design as follows:
Here, we loose the nice direct interpretability of the coefficients. Now β 1 is simply the mean of the second group. We will discuss the extraction of interesting contrasts (i.e. linear combinations of coefficients) from a model like this below.
We explicitly excluded the intercept by specifying it as zero. Commonly it makes sense to include an intercept in the model, especially in more complex models. We can specify a more complex design pretty easily: if we have two independent factors, the base mean now corresponds to the first levels of the two factors.
The “drop one” strategy is the default method for creating regression coefficients from factors in R. If a factor has d levels, adding it to the model will give you d − 1 regression coefficients corresponding to d − 1 columns in a design matrix. Apart from excluding the intercept, you can also use the I function to treat a covariate as it is without using the formula syntax. The code below includes z 2 as a covariate.
Singular model matrices
Whatever your model matrix looks like, you should make sure that it is non–singular. Singularity means that the measured variables are linearly dependent and leads to regression coefficients that are not uniquely defined. In linear algebra terms, we say that the matrix does not have full rank, which for design matrices means that the actual dimension of the space spanned by the column vectors is in fact lower than the apparent one. This leads to a redundancy in the model matrix, since some columns can be represent by linear combinations of other columns.
For design matrices, which contain factors, this happens if two conditions are confounded, e.g. in one experimental group there are only females and in the other group there are only males. Then the effect of sex and experimental group cannot be disentangled.
Let’s look at an example. We set three factors, of which the third one is nested with the first two. We can check the singularity of the model matrix by computing its so called singular value decomposition and check it’s minimal singular value. If this is zero, the matrix is singular. As we can see, this is indeed the case here.
We have one column in the design matrix that can be represented by a linear combination of the other columns, thus the column space has actually a lower dimension than the apparent one. For example, we can represent column 5 (“zm”) by a linear combination of the first two columns “intercept” and “x2”:
I.e., in mathematical notation this means
Thus, the corresponding regression coefficients are not uniquely determined and the model does not make much sense. Therefore, the non–singularity of the model matrix should always be checked beforehand.
Using contrasts to test hypotheses in linear regression
In differential expression analysis, our most important covariates will be factors that differentiate between two or more experimental groups, e.g. the covariate X p = ( x p1, . . . , x pn) is either zero or one depending on which group the sample belongs to.
We will illustrate this concept using a small data set called
•
•
•
We transform car into a factor so that R performs the necessary parameterization of the contrasts automatically.
Figure 7.
Boxplots showing the distance traveled by each car in the toycars data.
By looking at the box plots of distance by car, we can clearly see differences between the three types of cars. We can now fit a linear model with distance as the dependent variable and car and angle as the predictors. As we can see from the linear model output, the treatment contrast parameterization was used, with car 1 being the base level.
Testing general linear hypotheses
The estimated coefficients now give us the difference between the distance traveled between car 1 and car 2 (0.11) and car 1 and car 3 (-0.08), and the associated t–tests of these coefficients. However we cannot see a test of car 2 vs. car 3. This contrast test would correspond to testing the difference between the car 2 and car 3 regression coefficients.
Thus, contrasts of interest to us may not readily correspond to coefficients in a fitted linear model. However, one can easily test general linear hypotheses of the coefficients of the form:
Where C is the contrast matrix containing the between group differences of interest,
q is the total number of comparisons to be performed and
α contains the difference to be tested, this is usually a vector of zeros. If one tests multiple coefficients at once (e.g.
β
1 = 0 and
β
2 = 0 ) the corresponding test statistic is
F–distributed. If one just tests linear combinations of coefficients, e.g.
β
1 −
β
2 = 0,
β
1 −
β
2 − 2
β
3 = 0 or something similar the test statistic has t–distribution. The function
Note that the model does not actually have to be refitted in order to test the contrasts. This makes contrast matrix based testing more efficient and convenient than reformulating the model using a new paramerization of the factors to obtain the desired tests. We can use the function
There are three such comparisons and we can print the results by using the
Note that the term “contrast” is used in the context of (re)parameterization of the original model (as in “treatment contrasts”) and in the testing of linear hypotheses about the model coefficients. This can lead to some confusion, however usually it should be clear from the context whether a reparameterization or test of linear hypotheses is intended.
We can also fit a linear model without an intercept to the toycars data set. Now, the coefficients derived from the “car” factor represent car–wise means. Thus, the contrasts we have to form change, however, the results for the group comparisons do not.
Linear models for microarrays
We now apply linear models to microarrays. Specifically, we discuss how to use the limma for differential expression analysis. The package is designed to analyze complex experiments involving comparisons between many experimental groups simultaneously while remaining reasonably easy to use for simple experiments. The main idea is to fit a linear model to the expression data for each gene. Empirical Bayes and other shrinkage methods are used to borrow information across genes for the residual variance estimation leading to a “moderated” t–statistics, and stabilizing the analysis for experiments with just a small number of arrays 11.
In the following, we use appropriate design and contrast matrices for our linear models and fit a linear model to each gene separately.
A linear model for the data
The original paper is interested in changes in transcription that occur between inflamed and adjacent non–inflamed mucosal areas of the colon. This is studied in both inflammatory bowel disease types.
Since we have two arrays per individual, the first factor we need is a blocking factor for the individuals that will absorb differences between them. Then we create a factors that give us the grouping for the diseases and the tissue types. We furthermore simplify the names of the diseases to UC and DC, respectively. Then, we create two design matrices, one for each of the two diseases as we will analyze them separately in order to follow the analysis strategy of the original paper closely (one could also fit a joint model to the complete data set, however, the two diseases might behave very differently so that a joint fit might not be appropriate).
We can inspect the design matrices and test their rank.
Contrasts and hypotheses tests
We now fit the linear models and define appropriate contrasts to test hypotheses of interest. We want to compare the inflamed to the the non–inflamed tissue. Thus, we create a contrast matrix consisting of one row. limma ’s function
Figure 8.
Histogram of the p–values for Crohn’s disease.
Figure 9.
Histogram of the p–values for ulcerative colitis.
Extracting results
Results can be extracted by use of the
We call around 500/1000 genes in the two conditions at the same cutoff, this higher number of DE genes identified is probably due to the increased power from the blocking according to the individuals and the moderated variance estimation that limma performs.
Comparison to the paper results
We now compare our list of differentially expressed genes to the results obtained in the paper. The paper results can be downloaded as excel files from http://links.lww.com/IBD/A795. We save it in an .xlsx file named
Figure 10.
Selecting a background set of genes for the gene ontology analysis.
We see that we get a moderate overlap of 0.678 for CD and 0.692 for UC. Note that is recommended to always to choose an FDR cutoff instead of a p–value cutoff, since this way you control an explicitly defined error rate and the results are easier to interpret and to compare. In what follows, we choose an FDR cutoff of 10%.
Gene ontology (GO) based enrichment analysis
We can now try characterize the identified differentially expressed genes a bit better by performing an GO enrichment analysis. Essentially the gene ontology ( http://www.geneontology.org/) is a hierarchically organized collection of functional gene sets 16– 18.
Matching the background set of genes
The function
We do this in order not to select a biased background since the gene set testing is performed by a simple Fisher test on a 2×2 table. Note that this approach is very similar to commonly used web tools like GOrilla 20. Here we focus on the CD subset of the data.
For every differentially expressed gene, we try to find genes with similar expression.
We can see that the matching returned a sensible result and can now perform the actual testing. For this purpose we use the topGO which implements a nice interface to Fisher testing and also has additional algorithms taking the GO structure into account, by e.g. only reporting the most specific gene set in the hierarchy21.
The GO has three top ontologies, cellular component (CC), biological processes (BP), and molecular function (MF). For illustrative purposes we limit ourselves to the BP category here.
Running topGO
We first create a factor
We now initialize the topGO data set, using the GO annotations contained in the annotation data base for the chip we are using. The
Now the tests can be run. topGO offers a wide range of options, for details see the paper or the package vignette.
We run two common tests: an ordinary Fisher test for every GO category, and the “elim” algorithm, which tries to incorporate the hierarchical structure of the GO and tries “decorrelate” it in order to report the most specific significant term in the hierarchy.
The algorithm starts processing the nodes/GO categories from the highest (bottommost) level and then iteratively moves to nodes from a lower level. If a node is scored as significant, all of its genes are marked as removed in all ancestor nodes. This way, the “elim” algorithm aims at finding the most specific node for every gene.
The tests uses a 0.01 p–value cutoff by default.
We can now inspect the results. We look at the top 100 GO categories according to the “Fisher elim” algorithm. The function
Visualization of the GO–analysis results
A graph of the results can also be produced. Here we visualize the three most significant nodes according to the Fisher elim algorithm in the context of the GO hierarchy.
We can see that indeed GO categories related to inflammation, signalling and immune response show up as significant. Gene set enrichment analysis has been a field of very extensive research in bioinformatics. For additional approaches see the topGO vignette and the references therein and also in the GeneSetEnrichment view.
A pathway enrichment analysis using reactome
The package ReactomePA offers the possibility to test enrichment of specific pathways using the free, open-source, curated and peer reviewed pathway Reactome pathway database22,23. The package requires entrez identifiers, so we convert our PROBEIDs (trancript cluster identifiers) to entrez identifiers using the function
We can now run the enrichment analysis that performs a statistical test based on the hypergeoemtric distribution that is the same as a one sided Fisher–test, which topGO calls “Fisher–classic”. Details can be found in the vignette of the DOSE package24.
Note that we trimmed pathway names to 20 characters.
Figure 11.
A graphical representation of the topGO results.
Visualizing the reactome based analysis results
The reactomePA package offers nice visualization capabilities. The top pathways can be displayed as a bar char that displays all categories with a p–value below the specified cutoff.
The “enrichment map” displays the results of the enrichment analysis as a graph, where the color represents the p–value of the pathway and the edge–thickness is proportional to the number of overlapping genes between two pathways.
Again, we see pathways related to signalling and immune response.
The package clusterProfiler 25 can also perform these analyses using downloaded KEGG data. Furthermore, the package EnrichmentBrowser 26 additionally offers network–based enrichment analysis of individual pathways. This allows the mapping of the expression data at hand to known regulatory interactions.
Figure 12.
Enriched Reactome pathways and their p–values as a bar chart.
Figure 13.
Enriched Reactome pathways enrichment results as a graph.
Session information
As the last part of this document, we call the function sessionInfo, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record of this as it will help to track down what has happened in case an R script ceases to work or gives different results because the functions have been changed in a newer version of one of your packages. By including it at the bottom of a script, your reports will become more reproducible.
The session information should also always be included in any emails to the Bioconductor support site along with all code used in the analysis.
Data and software availability
This article is based on an R markdown file (MA-Workflow.Rmd) which is available as Dataset 1 (Dataset 1. R markdown document to reproduce the results obtained in the article, 10.5256/f1000research.8967.d124759) 31 and will also become available as a Bioconductor workflow. This file allows the reader to reproduce the analysis results obtained in this article. All data analyzed are downloaded from ArrayExpress.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2016 Klaus B. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In this article, we walk through an end–to–end Affymetrix microarray differential expression workflow using Bioconductor packages. This workflow is directly applicable to current “Gene” type arrays, e.g. the HuGene or MoGene arrays but can easily adapted to similar platforms. The data re–analyzed is a typical clinical microarray data set that compares inflammed and non–inflammed colon tissue in two disease subtypes. We will start from the raw data CEL files, show how to import them into a Bioconductor ExpressionSet, perform quality control and normalization and finally differential gene expression (DE) analysis, followed by some enrichment analysis. As experimental designs can be complex, a self contained introduction to linear models is also part of the workflow.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer