EXPRESSION PROFILING — BEST PRACTICES FOR DATA GENERATION AND INTERPRETATION IN CLINICAL TRIALS
The Tumor Analysis Best Practices Working Group about the author Corresponding author: ehoffman@cnmcresearch.org Nature Reviews Genetics5, no 3, 229-237 (2004); doi:10.1038/nrg1297
Microarrays are routinely used to assess mRNA transcript levels on a genome-wide scale. As use and acceptance increases, there is intensified focus on appropriate methods of data generation and interpretation, with important questions being asked about the best data analysis methods. The development of such 'best practices' is needed, as microarrays — in particular, Affymetrix oligonucleotide arrays — are becoming increasingly important in human clinical trials, both for differential diagnosis and monitoring of pharmacological efficacy. Here, representatives from high-volume microarray core centres consider the current status of 'best practices', focusing on the broadly used Affymetrix oligonucleotide arrays.
Рис.1. | Sample processing and microarray interpretation of Affymetrix GeneChips.
Рис.2. | Dense time-series data with adequate replicates can provide robust visual interpretation of data.
Табл.1Comparisons of probe-set analysis algorithms
Биологические микрочипы (biochips) или, как их чаще называют — DNA microarrays, — это один из новейших инструментов биологии и медицины 21 века. Изобретены биочипы были в конце 90-х годов в России и в США. В настоящее время они активно производятся несколькими американскими биотехнологическими фирмами. Производят биочипы также и в России, в Центре биологических микрочипов Института молекулярной биологии РАН.
Биочипы используются для самых разных целей. Армия США утверждает, что обладает биочипами, позволяющими очень быстро определять наличие в окружающей среде болезнетворных микробов. В медицине биочипы помогают за считанные часы обнаруживать у больных лекарственно устойчивые формы туберкулеза. Еще одно очень важное медицинское применение биочипов — это диагностика лейкозов и других раковых заболеваний. Биочипы позволяют быстро, за считанные дни или даже часы различать внешне неразличимые виды лейкозов. При одних лейкозах пациента можно быстро и эффективно вылечить современными лекарствами. При других — не стоит даже пробовать, надо сразу делать пересадку костного мозга. Врачи не могут быстро отличить эти лейкозы друг от друга, а стратегию лечения надо правильно выбирать с самого начала. Также биочипы позволяют сразу отличить две формы рака груди — легко излечимую и плохо излечимую. Биочипы применяются и для диагностики других видов рака.
Исследователи в университетах и в фармакологических фирмах проводят на чипах одновременный анализ работы тысяч и десятков тысяч генов и сравнивают экспрессию этих генов в здоровых и в раковых клетках. Такие исследования помогают создавать новые лекарственные препараты и быстро выяснять, на какие гены и каким образом эти новые лекарства действуют. Биочипы являются также незаменимым инструментом для биологов, которые могут сразу, за один эксперимент, увидеть влияние различных факторов (лекарств, белков, питания) на работу десятков тысяч генов.
Что же это такое — биочипы? Точнее всего их описывает английское название DNA-microarrays, т.е. это организованное размещение молекул ДНК на специальном носителе. Профессионалы называют этот носитель «платформой». Платформа — это чаще всего пластинка из стекла или пластика (иногда используют и другие материалы, например кремний). В этом смысле чипы биологические близки к чипам электронным, которые и базируются на кремниевых пластинах. Это организованное размещение занимает на платформе очень небольшой участок размером от почтовой марки до визитной карточки, поэтому в названии биочипов присутствует слово micro. Микроскопический размер биочипа позволяет размещать на небольшой площади огромное количество разных молекул ДНК и считывать с этой площади информация с помощью флуоресцентного микроскопа или специального лазерного устройства для чтения.
Способы изготовления биочипов тоже бывают разными. Одна из крупнейших фирм по производству биочипов — Affymetrix — изготовляет биочипы таким же способом, каким изготовляют электронные чипы (и расположена эта фирма в Силиконовой долине, в Калифорнии). Чипы Affymetrix наращиваются прямо из стеклянной пластинки методом фотолитографии с использованием специальных микромасок. Применение отработанных методов электронной промышленности позволило добиться впечатляющих результатов. На одном таком чипе расположены десятки (а иногда и сотни) тысяч пятен размером в несколько микрон. Каждое пятно — это один уникальный фрагмент ДНК длиной в десятки нуклеотидов.
Изготовленный таким образом биочип в дальнейшем гибридизуют с молекулами ДНК, мечеными красителем. Сравнивают, например, ДНК, выделенную из здоровых клеток, и ДНК, выделенную из раковых клеток. Часто сравнивают ДНК, выделенную из разных больных. После гибридизации на биочипе возникают причудливые узоры. Эти узоры бывают разными у нормальных и у раковых клеток или сильно различаются при различных видах лейкозов. Излечимые виды лейкозов дают одни узоры (паттерны), неизлечимые дают совсем другие паттерны. На рисунке 1 можно видеть, как окрашенная ДНК от разных больных образует различные паттерны на биочипе. Болезнь одна и таже, паттерны — разные. По виду паттернов можно с большой вероятностью предсказать течение болезни на самой ранней ее стадии. В данном случае при паттерне типа 1 верятность метастаз равна нулю, при паттерне типа 2 — уже 29%, при паттернах типа 3 и 4 соответственно 75% и 77%.
Биочипы изготавливают не только методом фотолитографии. Другой подход — это когда олигонуклеотиды (относительно короткие фрагменты однонитевой ДНК) синтезируют отдельно, а затем уже пришивают к биочипу. Чипы такого типа изготавливают в разных фирмах, в частности, в Москве, в Институте молекулярной биологии. Биочипы, изготовленные в ИМБ, позволяют различать у больных туберкулезом штаммы, отличные от штаммов устойчивых к антибиотикам. Проблема состоит в том, что у некоторых больных бактерии туберкулеза имеют устойчивость к антибиотику рифампицину и антибиотик не помогает при лечении болезни. У большей части больных бактерии обычные (т.н. дикие штаммы бактерий) и антибиотик помогает. Необходимо знать устойчивость бактерий к антибиотику в самом начале лечения. Если врачи определят устойчивость бактерий через 2–3 месяца после начала лечения, то легкие больного будут уже изрядно повреждены. Традиционные методы определения устойчивости бактерий туберкулеза могут отнять несколько недель. Биочипы позволяют решить эту задачу за 1–2 дня. На рисунке 2 видны различные узоры паттерны гибридизации на российском биочипе двух штаммов туберкулеза: дикого и устойчивого к рифампицину.
На рисунке представлены гибридизационные картины (A, B) и соответствующие им диаграммы интенсивности флуоресцентных сигналов (C, D). A и C — прогибридизована последовательность дикого типа, B и D — последовательность содержит мутацию, приводящую к замене His526>Tyr (показана стрелкой).
Теперь приведем пример новейших специальных разработок в области биочипов. Специалисты из Нортвестернского университета в США разработали для американской армии биочип, обладающий совершенно неожиданными свойствами. Если на этот биочип попадает ДНК от патогенных микробов, то фрагменты ДНК зондов с прикрепленными к ним микроскопическими частицами золота выстраиваются в ряд. Между электродами идет ток и биочип сигнализирует об угрозе. Схема такого биочипа приведена на рисунке. Специальный биочип сигнализирует о наличии бактериальной угрозы после того, как золотые микрочастицы замыкают два электрода.
Сейчас также разрабатывают белковые микрочипы. Но это уже отрасль большой новой науки — протеомики.
Биочипы сегодня — это быстро развивающийся рынок, где работают десятки фирм. Биочипы будут составлять основу биомедицины 21 века.
Александр Крылов, Институт молекулярной биологии РАН
Measuring relative gene expression by using DNA microarrays. Capillary printing is used to array DNA fragments onto a glass slide (upper right). RNA is prepared from the two samples to be compared, and labeled cDNA is prepared by reverse transcription, incorporating either Cy3 (green) or Cy5 (red)(upper left). The two labeled cDNA mixtures are mixed and hybridized to the microarray, and the slide is scanned. In the resulting pseudocolor image, the green Cy3 and red Cy5 signals are overlaid--yellow spots indicate equal intensity for the dyes. With the use of image analysis software, signal intensities are determined for each dye at each element of the array, and the logarithm of the ratio of Cy5 intensity to Cy3 intensity is calculated (center). Positive log(Cy5/Cy3) ratios indicate relative excess of the transcript in the Cy5-labeled sample, and negative log(Cy5/Cy3) ratios indicate relative excess of the transcript in the Cy3-labeled sample. Values near zero indicate equal abundance in the two samples. After several such experiments have been performed, the dataset can be analyzed by cluster analysis (bottom). In this display, red boxes indicate positive log(Cy5/Cy3) values, and green boxes indicate negative log(Cy5/Cy3) values, with intensity representing magnitude of the value. Black boxes indicate log(Cy5/Cy3) values near zero. Hierarchical clustering of genes (vertical axis) and experiments (horizontal axis) has identified a group of coregulated genes (some shown here) and has divided the experiments into distinct classes. (Illustration by J. Boldrick, Stanford University).
Микрочипы представляют собой огромное технологической достижение в молекулярной биологии. Предложение любого такого новшества обычно сопровождается периодом оптимизации и стандартизации. Последнее является критической частью любой зрелой технологии, т.к. это делает возможным подход, при котором успехи достигнутые параллельно отдельными исследователями и компаниями, вносят свой вклад в новое знание, базируясь на существующих стандартах. Любые такие стандарты д. постоянно подвергаться переоценке; застывшие стандарты м. задерживать развитие технологии.
Базирующееся на микромассивах профилирование (profiling) экспрессии мРНК м. рассматриваться как первая зрелая технология анализа генома в целом, это отражается в повышенном интересе к использованию микромассивов как конечной точки клинических испытаний. Однако руководство клиническими испытаниями нуждается в разработке чётких стандартов использования и интерпретации данных микромассивов (обычно обозначается как quality control and standard operating procedures (QC/SOPs) и/или 'best practices'). Путеводной нитью для сообщений и аннтотаций данных микромассивов м. служить Microarray Gene Expression Data (MGED) Society (см. online links box) - использующая MIAME (Minimum Information About A Microarray Experiment) стандарты (Box 1) и MAGE-ML mark-up язык1,2 - это представляет собой важную ступень в направлении к этой цели. Усилия этого многонационального партнерства академической науки и индустрии сделали возможным развитие базы данных, которая м. содержать многие типы данных по микромассивам с одной и той же структурой данных. ArrayExpress microarray database3 (см. online links box) является первой доступной всем базой данных, которая придерживается этой универсальной платформы предоставления баз данных, а некоторые выдающиеся журналы (такие как Nature, Cell, EMBO Journal и The Lancet) требуют, чтобы публикуемые данные по микромассивам соответствовали стандартам MIAME. Кроме того, производители микромассивов, такие как Affymetrix, осуществляют MIAME-соответствующий выход данных в своих новых выпусках software.
Общество MGED эффективно разрабатывает пособие по предоставлению данных, но оно не предназначено для решения вопросов генерации и интерпретации данных. Последнее более интимно связано со специфическими экспериментальными платформами. Из трёх наиболее обще распространённых типов микромассивов (spotted cDNA, spotted oligonucleotide and Affymetrix arrays), каждый имеет самостоятельную методологию; соотв. и вопрос об интерпретации данных также отличен (Box 2). Эти различия делают трудным или невозможным развитие руководящих принципов перекрёстных платформ для генерации и интерпретации данных. Best practices для spotted cDNA arrays особенно проблематичны, т.к. производство массивов существенно варьирует от места к месту. Кроме того, все, spotted методики используют ко-гибридизацию выборки тест-РНК, меченной одноцветным FLUOROPHORE с контрольной РНК, меченной др. цветом, относительно тестовой РНК сравниваемой в одном и том же пятне. Результат регистрируется в форме соотношения гибридизационных сигналов, которое сравнимо с таковым в др. экспериментах, если только используется та же самая контрольная РНК. Следовательно, разработка стандартов в spotted arrays необходима всем лабораториям, чтобы использовать один и тот же раствор контрольной РНК, чтобы данные м.б. сравнимы.
Производство массивом олигонуклеотидов (и механически spotted и синтезированных in situ) имеет преимущество в том, что продуцируется центрально в контролируемых условиях. Affymetrix PHOTOLITHOGRAPHY-производит массивы, используемые уже почти 10 лет, тогда как механически spotted массивы олигонуклеотидов появились на рынке совсем недавно. Напр., Agilent Technologies (см. online links box) недавно выпустила 17,000 60-mer олигонуклеотиды, отпечатанные каждый пятикратно на стекляном slides (85,000 FEATURES). Spotted массивы олигонуклеотидов обычно имеют одну точку на ген (измерение одиночного зонда), тогда как массивы Affymetrix нуждаются во множественных измерениях - серии независимых или полу-независимых олигонуклеотидов, запрашивая каждую РНК в растворе (набор зондов) (Box 2). Affymetrix набор зондов конструируется из серии perfect-match и paired-mismatch олигонуклеотидов, это делает возможной некоторую оценку неспецифического связывания и эффективности зондов. В целом Affymetrix набор зондов обеспечивает разнообразные измерения, которые делают возможными точные измерения экспрессии генов. Использование множественных perfect-match и mismatch зондов для каждого гена делает возможной разработку разных методов интерпретации паттернов гибиридизации с набром зондов и подсчёт одного 'expression level' или 'signal', который будет отражать относительный уровень экспрессии гена. Доступен ряд алгоритмов интерпретации probe-set для массивов Affymetrix.
Растущее использование микромассивов Affymetrix и появление этой технологии в качестве конечной точки в клинических испытаниях вызвало потребность в разработке и в фармакоцевтических и академических исследовательских сообществах best practices для генерации и анализа данных. Учитывая множество различий между массивами spotted кДНК, spotted олигонуклеотидами и Affymetrix, необходимо разрабатывать best practices отдельно для каждой экспериментальной платформы; это противоречит указаниям, что возможна стандартизация единая для всех платформ (Boxes 1 и 2). The Tumor Analysis Best Practices Working Group (см Box 3) был приглашен к дискуссии и разработке best practices для микромассивов Affymetrix, включая и QC и SOPs как для получения данных, так и для анализа данных. Первая встреча состоялась в Santa Clara в марте 2003, затем была серия конференций, на которых основное внимание было уделено обсуждению получению данных и анализу стандартов для массивов олигонуклеотидов Affymetrix. The Working Group умышленно сконцентрировалась на платформе, которая находится в широком использовании и скорее всего будет использована в клинических испытаниях благодаря уже стандартизированному процессу производства. Здесь мы обсудим рекомендации по организации экспериментов, алгоритмы анализа probe-set, оценки сигналов/шумов и методы биостатистики.
Experimental design
Соответствующая организация эксперимента является ключевым аспектом во всех науках и изучение микромассивов не исключение. Относительно высокая цена некоторых коммерческих платформ микроассивов часто цитируется как объяснение субоптимальной организации эксперимента, особенно в отношении количества реплик. Интерпретация данных безусловно окажется неточной (compromised), если будет уменьшено количество повторов.
Replication in cross-sectional studies. Соотв. количество реплик микромассивов для любого конкретного условия или временной точки зависит от источника биологической вариабельности в изучаемых выборках. Меж-индивидуальная изменчивость очень высокая у аутбредных (генетически гетерогенных) людей, но очень мала в инбредных линиях мышей. Напр., профили экспрессии, полученные из мышц разных мышей не более изменчивы, чем из разных мышц одной и той же мыши4. Определение смещений варианс, которые вносят вклад в экспериментальную изменчивость, таких как внутри субъекта, между субъектами, между группами и технические вариации (microarray protocol), необходимо для правильной организации и статистической достоверности исследования, и чтобы определить количество необходимых реплик. В целом инбредные мыши нуждаются в тестировании только 3 или 4 мышей на группу. Мы и др. установили, что 5 или 6 аутбредных крыс на группу обеспечивает статистически достоверные результаты5,6. Напротив, людские выборки нуждаются в значительно большем количестве индивидов на группу. Ключевая изменчивость в выборках людей включает тканевую гетерогенность, стадию болезни и меж-индивидуальную изменчивость, каждая из которых м. оказаться основной смещающей (confounding) переменной
7.
Может кто-нибудь грамотный переведет дальше и подарит для всеобщего пользования? С уважением В.А. Мглинец (mglinetz@medgen.ru)
Replication in longitudinal studies. It has long been recognized that, in human clinical trials, LONGITUDINAL DESIGNS provide considerably greater power at lower numbers of replicates. They best control for inter-individual variation because each subject serves as their own control. Serial blood sampling from single subjects is the least invasive8 (see below for further discussion), and, for example, cancer patients are often longitudinally sampled9. Serial biopsies of other tissues are more invasive: however, a number of serial human muscle biopsy studies of healthy volunteers after different types of exercise training have begun to appear in the literature10,11.
Expression profiling of blood samples (longitudinal or CROSS-SECTIONAL DESIGN) is the protocol that is most likely to be used in human clinical trials. One of the Working Group's goals was to establish SOPs for blood sample collection and RNA isolation in clinical trials. A specific follow-up report of these recommendations will be published elsewhere. Such a protocol must be easily adaptable to multiple trial sites, with relatively little need for resident expertise to carry out the isolation protocol. So far, standard methods for isolating peripheral-blood mononuclear cells have shown the most reproducibility, although others are being tested (see Affymetrix Technical Note in online links box). Cells isolated soon after collection can be flash frozen for storage and subsequent RNA isolation or an RNA stabilizing compound can be added if the samples need to be transported.
Tissue/cell heterogeneity. Tissue heterogeneity is a major confounding variable in most microarray experiments. In inbred mice, tissue heterogeneity is typically normalized by using whole organs. This is rarely possible in human experiments, and particularly not in clinical trials; the limited amount of human tissue that is available exacerbates heterogeneity. The mixed cell populations of peripheral blood can be thought of as a tissue heterogeneity problem similar to that encountered in all solid tissue and tumour biopsies. Indeed, a recent study showed that variation as a result of tissue variability in human muscle biopsies often exceeded inter-individual variability12. One potential solution to the tissue heterogeneity problem lies in bioinformatic methods. If computer software can be trained to recognize the expression profiles of each individual cell type within a mixed tissue sample, then it should be possible to subtract them from each other and to renormalize to obtain a set of cell-specific expression profiles derived from a single mixed profile. This will be most easily done on tumour biopsies, in which the main cells of interest are tumour versus contaminating normal tissue. Although there are no published examples so far, such methods are maturing rapidly.
An experimental alternative to mitigate confounding tissue heterogeneity is to isolate pure cell populations for expression profiling. Many such methods are well developed in the research laboratory, including FLUORESCENCE-ACTIVATED CELL SORTING (FACS)13, NEGATIVE CELL ISOLATIONS from blood (for example, Stem Cell Technologies RosetteSep)14 and LASER CAPTURE MICRODISSECTION15. To research scientists, the profiles that are derived from isolated cell types are a more intuitive approach to define biologically relevant pathways. However, it should be noted that uses of array-based analysis of gene expression approved by the US Food and Drug Administration (FDA) will probably focus on reproducibility and robustness (as well as on predictive accuracy), rather than on biological interpretation or justification. The high-tech methods used to isolate specific cell types from clinical samples are unlikely to make their way into clinical trials unless tissues are procured in a highly centralized way.
Procedural variation. Beyond the usual issues of sampling and accrual, gene-expression data will be subject to many additional sources of error. For example, the surgical removal and processing of tumour tissue can vary considerably from site to site. Laboratory QC procedures in tissue handling, RNA extraction and processing, and variations in protocols for data management and processing will need to be addressed in any clinical trial design. In particular, prolonged tissue ISCHAEMIA prior to processing of surgically RESECTED tissue can significantly alter gene expression16. All tissue samples should be flash frozen within minutes of surgery and stored at -80°C or below. Samples should also be kept in small, airtight containers and kept from drying out during frozen storage by placing fragments of ice in with the sample.
Technical variability
The standard laboratory protocol for generating RNA profiles using Affymetrix microarrays involves a series of steps (Fig. 1).
RNA isolation. RNA quality and quantity is crucial to the success and reproducibility of the expression profiles. RNA quantity and quality is generally checked by complementary methods: UV 260/280 ratio >1.8, agarose gel electrophoresis or an Agilent Bioanalyzer to visualize clear 18S and 28S ribosomal RNA bands. Total RNA (5–10 μg) is input into the cDNA/cRNA reaction, with an expected corrected yield of biotinylated cRNA of between 4- and 10-fold greater than the total RNA input (so 5 μg of total RNA must yield at least 20 μg of biotinylated cRNA, or the sample is discarded). The biotinylated cRNA should be 500–3,000 base pairs (bp) in size. After fragmentation, the cRNA should be 50–200 bp. The Working Group recommends that samples that do not meet these criteria should be discarded.
If RNA amount is limiting — as is the case, for example, with laser capture microscopy samples, flow-sorted cell samples or small tissue samples — a two-round amplification protocol can be used. For example, 200 ng of total RNA is processed for in vitro transcription (IVT), with the same goal of 4–10-fold amplification (>800 ng of cRNA output). One hundred nanograms of this cRNA is then reverse transcribed into cDNA using random primers, after which a second IVT is done. The second round IVT should result in a 400-fold amplification.
Microarray controls. Hybridization controls include visualization of the image so that any abnormalities in hybridization patterns can be detected. ProbeProfiler from Corimbia Inc. is a program with extended capabilities for detecting defects in microarray manufacture. Affymetrix MAS 5.0 software adjusts the microarray-scanned image to a common target intensity by using a scaling factor. In addition, a general index of chip background and noise is represented by the percentage of 'present calls' (probe sets for which the hybridization to the perfect-match probes is significantly higher than mismatch hybridization). The Working Group believes that both the scaling factor and the percentage of present calls are important QC criteria. Considering MAS 5.0 chip analyses, the scaling factors to normalize chips within a project should lie within two standard deviations of the mean, with present calls being greater than 25% (Box 4). The percentage of present calls is often lower when B or C arrays that contain higher proportions of more poorly characterized transcript units (expressed sequence tags or computer-predicted open reading frames) are used. The percentage of present calls across a set of samples should be consistent, within a range of 10%. Some software packages allow the identification of statistical 'outlier' microarrays in a group of microarrays in a given project, which additionally enables the experimenter to flag and exclude specific microarrays that are not acceptable for an analysis. In addition to these criteria, acceptable hybridizations must have adequately intact input RNA as shown by 3' to 5' ratios of hybridization within probe sets. A typical control is the glyceraldehyde 3-phosphate dehydrogenase (GAPDH) gene, which should have 3' to 5' ratios of less than 3 (Box 4).
The QC criteria provided above are based on MAS 5.0 probe-set algorithms and data analyses. The measures of present calls and scaling factors are useful and serve as initial summary measures of the performance of a particular microarray. However, more focused statistical methods, coupled with routine visual inspection of images, hold promise for the continuing improvement of data quality and screening abilities.
Large-scale analyses of microarray data across laboratories have not yet been reported. However, the Working Group feels that adherence to the above QC criteria, using standard RNA isolation and processing methods, should yield data that are consistent between laboratories and intrinsically comparable. The same set of criteria can also be used as best practices for data generation in the design and conduct of clinical trials.
Standard clinical laboratory practice is to develop programmes for submission of known samples to different laboratories and assessment of comparability of results. Such programmes are under development within larger collaborative efforts, such as the National Heart, Lung and Blood Institute (NHLBI) Programs in Genomic Applications (see the HOPGENE Program for Genomic Applications in online links box) and the National Institute of General Medical Sciences (NIGMS) Glue Grant (see online links box).
Data analysis and interpretation
Signal generation versus statistical analyses. Two relatively distinct steps underlie all data analyses of Affymetrix oligonucleotide microarrays: the development of a normalized 'signal' for each transcript on each microarray and the subsequent statistical analysis of differences in signals between different arrays. The first step involves probe-set algorithms that use all, or part, of the component signals within a probe set and then derive a single signal that is representative of the relative abundance of each mRNA queried in each array. The second step is the application of bioinformatic and statistical methods to identify interesting subsets of the assembled data of all arrays within a project. There is considerable debate about the best methods for both of these steps (see below for a discussion). Although the two steps are separable, it is clear that they have a marked influence on each other. It is in this realm that the bioinformatics of microarrays becomes avant-garde, and with the ground-breaking nature of research comes considerable debate as to what is appropriate in any specific situation.
Before discussing the different methods for probe-set analysis and data interpretation, it is important to point out that much of the debate in the field of bioinformatics about microarray interpretation revolves around signal/noise ratios. A common assumption is that signal/noise ratios across a microarray are homogeneous, or at least similar in magnitude. This might be true for general background hybridization, but not for the performance of probe sets. In any particular microarray, there are probe sets that give very strong and clear hybridization patterns and those that perform poorly. Many of the best performing probe sets (those with a highly significant probe-set detection p value) reflect highly expressed transcripts with no closely related sequences that might cross-hybridize. Low-level transcripts, or transcripts that belong to gene families with highly homologous sequences derived from distinct genes, often have corresponding probe sets that do not perform as well and might have a significant, if not overwhelming, noise component. The signal from such probe sets is difficult to interpret, and data interpretation can be limited to only the best performing probe sets, although arguably the most interesting data comes from the genes that are expressed at low levels but that still show significant differences between samples.
Determining adequate sensitivity of the signals and signal/noise responses relative to the absolute quantity of mRNA in clinical samples is crucial as microarrays become a component of clinical trials and diagnostic models. Affymetrix arrays provide a concentration of each mRNA queried relative to the genome-wide mRNA profile of the sample; it is assumed that the global mRNA content of a tissue as a whole does not change significantly, making relative mRNA quantification an accurate reflection of the response of the individual gene. This method differs from absolute quantification of specific mRNAs (such as S1 NUCLEASE PROTECTION and REAL-TIME PCR), or the isolated transcript ratio determined by co-hybridization of two samples to spotted cDNA or oligonucleotide arrays (Box 2). Affymetrix arrays achieve considerable sensitivity through the inherent redundancy of the probe set; however, the Working Group acknowledged that some genes, such as some cytokines that are functional at very low expression levels, are probably below the limit of detection.
The Working Group agreed that each project will have its own signal/noise optimum, and analysis methods that prove best for one project might prove unsuitable for another. Ideally, a signal/noise ratio should be optimized for each project or trial using different probe-set algorithms and data-filtering methods, and some systematic efforts towards this end are beginning to appear in the literature17.
After a signal is derived for each probe set, data is interpreted using statistical and visualization methods. All statistical methods run into two generic problems when faced with microarray data that are inter-related. The first is the curse of dimensionality — each gene is potentially related to every other gene, so all permutations of all available data must be considered, leading to an exponentially increasing number of possible associations in multidimensional space. The problem arises when associations (samples) become lost as the dimensionality increases — associations lose their local value and become generically global in statistical terms. Statistical models attempt to circumvent this curse by requesting larger and larger sample sizes, but fulfilling the requests becomes functionally impossible for the experimentalist. There is no easy answer to these problems and they remain a challenge for future bioinformatics research that uses microarrays18.
Derivation of signal: probe-set algorithms and normalizations. One of the key advantages of the Affymetrix platform is the multiple measurements that are intrinsic to the probe set — most probes include 11 perfect-match and 11 paired-mismatch 25-bp oligonucleotides per gene (Fig. 1). Previous versions of GeneChip arrays used probe-set design methods that led to considerable overlap between probes, so that hybridizations to each feature/probe were not independent measurements; this led to considerable uncontrolled weighting of the contribution of any particular region of sequence to the resulting signal. All recent chips use a much more refined probe-set design with less overlap and considerably better performance of the probe set. Improvements in array and probe-set designs have been accompanied by an evolution in primary analysis algorithms and the supporting software provided by Affymetrix for data analysis and interpretation19. Affymetrix default algorithms are based on well-documented statistical methods, namely the robust TUKEY'S BI-WEIGHT ESTIMATOR and WILCOXON'S SIGNED RANK, to calculate the final probe-set signals and associated p values, respectively19,20.
Affymetrix has announced plans to continue to improve the software components of the GeneChip platform. The upcoming release of the GeneChip Operating System (GCOS) is expected to incorporate refinements in the user interface, data management and analysis algorithms. Software tools aside, the most significant development on the analysis front is arguably the decision by Affymetrix to release previously proprietary chip-design details, such as probe sequences, chip-design parameters and file APIs (applications programming interfaces). The goal is to encourage scientists to develop innovative analysis tools that can potentially derive more biological value from GeneChip expression data. The challenge of providing a constantly growing and evolving body of information associated with arrays has been solved in part with a web-based tool. The company's NETAFFX web site (see online links box) serves as the public portal for detailed information on chip design and has become a valuable resource for biological follow-up of GeneChip expression results. Third-party software developers can find additional support, including information on file APIs, through the Affymetrix Developers' Network (see online links box).
Encouraged in part by the openness of the platform and spawned by an increase in knowledge and experience in array data analysis, scientists are developing a number of alternative algorithms for probe-set analysis, with the goal to derive the best signal that is representative of the mRNA level for each gene. As each signal is relative to other signals in the experiment (both between arrays for the same gene and relative to all other genes on the array), the process of normalization is intimately tied to derivation of signal. The more commonly used alternative probe-set analysis algorithms include dCHIP20, RMA21 and ProbeProfiler (Table 1).
It is outside the scope of this article to discuss the nature of the different probe-set interpretation and normalization algorithms in depth, and the reader is referred elsewhere22. The algorithms differ in a number of important ways (Table 1). First, the PENALTY WEIGHT that is assigned to the mismatch probe varies — MAS 5.0 assigns a relatively heavy penalty for cross-hybridization to the mismatch probe, RMA assigns no weight and dCHIP gives the choice of providing weight or no weight. Second, the ability to discard outlier signals varies from package to package, with dCHIP and ProbeProfiler having refined methods to detect outliers at each level of analysis (probe, probe set and microarray). These packages are able to replace deviant probes with expected data based on the remainder of the probe set, and/or flag abnormal probe sets and arrays for possible exclusion from further data analysis. Third, the method of normalization varies from within a single array (MAS 5.0) to a project-based normalization (dCHIP, RMA and ProbeProfiler). Finally, MAS 5.0 provides a detection p value, in which a number is assigned to the confidence of the signal in question. This can be used to weight different probe-set signals in subsequent data interpretations.
The output of all packages is a normalized signal (with or without an associated detection p value) for each probe set on each array. These signals are then fed into data interpretation packages for statistical analyses and data visualizations.
Different probe-set interpretation algorithms lead to different results. Members of the Working Group often encounter ~50% concordance in general data output in their own work between comparisons of two different algorithms. However, it is crucial to note that the large majority of discordant data lies in regions of relatively poor signal/noise ratios, and concordance deteriorates in experiments with high levels of confounding noise. In general, the programmes that put less weight on the mismatch show better sensitivity (linearity) when signals are noisy (Table 1). However, this increased sensitivity can come at a cost of substantial contaminating noise17.
The Working Group recommends using at least two probe-set algorithms for comparison and prioritization of gene selection (for example, MAS 5.0 and the dCHIP difference model).
Data interpretation. Most published microarray papers could be considered data-poor in terms of replicates and systematic statistical analyses, but data-rich both in terms of amount of high-quality data generated and significant research findings. Below, we point out the most appropriate current bioinformatics methods and additional methods that require further development so that data can be more fully mined for information content.
A second general backdrop to the following discussion is that data visualization is one of the most powerful data interpretation tools, yet it rarely obeys statistical principles. The resolution of the human eye, coupled with the abstract computational power of the human brain, lies behind the popularity of hierarchical clustering and other non-statistical principles and visualization methods. However, the eye and brain are poorly suited to spontaneously deriving statistical support.
There are two general types of experimental design that lend themselves to different types of statistical and visual analysis: the cross-sectional study and the TIME-SERIES STUDY. The cross-sectional study typically has gene or pattern selection as the goal: the identification of one or more genes or patterns of expression that are diagnostic of the condition or state under study. This 'gene selection' might be for truly diagnostic purposes (for example, differential diagnosis of leukaemia), or might be intended to identify relevant biochemical pathways. In both cases, the gene or pattern selection must be robust, usually implying a statistically principled approach, with subsequent validation by predictive computer modelling (internal cross-validation) or, preferably, prospective validation on new data.
Feature selection can be the main limiting factor in evaluation of the predictive performance of an analysis method when there are many predictors to select from. This was a 'mantra' for some of the senior statisticians involved in predictive modelling with gene-expression array data for several years, but only now do the non-statistical users and developers of predictive models from non-statistical perspectives begin to appreciate these issues. Proper validation of any model or algorithm that relies on explicit feature selection — such as choosing a subset of 70 genes from 20,000 — that underlies the resulting prediction simply must ensure that the analysis is tested by internal cross-validation that includes feature re-selection as part of the validation23, 24. The Working Group acknowledged that prospective validation of any findings using new data is the acid test of predictive performance. The focus on feature or gene selection is vitally important when microarrays are used for differential diagnosis and has been best studied in cancer biopsy/tissue studies.
An increasing proportion of microarray studies focus on delineation of biochemical pathways that are modulated in response to some stimulus. In practice, these studies typically use feature selection to identify potential pathways that are involved in the response of the cells or tissues. Validation is then done on the identified biochemical pathways of interest, using mRNA (real-time PCR) or protein studies, often proving cause and effect in experimental models.
The Working Group notes that robust feature selection for the purpose of diagnosis and molecular markers in clinical trials requires robust statistical methods, as outlined below, and the burden of proof lies with statistical validation. For microarray experiments designed to delineate biochemical pathways, feature selection is used for generating a hypothesis and the burden of proof of the hypothesis lies with laboratory-based research, often at the protein level.
For feature selection, the Working Group recommends that users experiment with various statistical methods (such as standard parametric tests, nonparametric methods, false discovery rate and related methods25, global or local shrinkage of raw signal intensities and Stanford's 'nearest shrunken centroids'26). Developments related to SURVIVAL DATA ANALYSIS are receiving increased attention because clinical trials will raise the need to move that way. As a corollary, analysis methods that focus on signatures of groups of genes (such as averages of clusters, Duke's metagenes27-29 and Stanford's eigengenes30) seem worth stressing in predictive contexts. Whatever the specific statistical model that is applied for prediction, using aggregate gene expression has important consequences: measures of aggregation of expression over a group of genes with related profiles can reduce dimension (thereby mitigating the curse of dimensionality). This can reduce multiplicities and, to some degree, ease the problems of gene selection, multiple testing and co-linearity, while improving signal estimation by averaging correlated noise components.
Data visualizations, time series and candidate genes. The above discussions of biostatistics all assume that the analysis is targeted towards a cross-sectional study, in which the primary goal is diagnostic gene discovery (gene or feature selection). In other words, a series of microarrays with a very large number of transcripts defines the very small minority of genes that are correlated and therefore predictive of the biological variable of interest. There are alternatives to this standard experimental design that use entirely different types of analysis, and the statistical issues are also quite different, as explained below.
The time-series study, if done with enough time points, can provide an effective antidote to the curse of dimensionality — the action of any gene during a time-series study should make biological sense, such that each signal is relatively easily discernible from noise. Visual query of a large time-series data set for single gene responses to the controlled variable either might meet expectations and is therefore valid, or might not meet expectations and is discarded as uninteresting. As an example, we show a time-series study in which rats are given a bolus of methylprednisolone, after which their liver and muscle are studied as a function of time (Fig. 2). In this case, the same gene (got1) is queried using a web-based dynamic visualization tool, first in liver (Fig. 2a) and then in muscle (Fig. 2b). The data in the top panel are visually compelling; got1 in liver responds quickly and strongly to a bolus of 3-methylprednisolone, with relatively consistent replicates (each data point comes from a different animal) and a time course that is visually assuring so that complex statistical tests of the transcriptional response as a function of time are not needed. On the other hand, the same gene in muscle does not seem to respond to the drug6,31 (Fig. 2b). Through such gene queries, the variability in replicates and the appropriateness of the action of the gene as a function of time can quickly be assessed. Another advantage of time-series data is that such profiles act as biomarkers that are amenable to analysis and interpretation using pharmacodynamic models that predict the underlying mechanisms of control of gene expression32.
The Working Group agreed that data-rich, time-series experimental designs provide some latitude in reporting significant findings and that the query of individual genes within large data sets can circumvent complex issues of multidimensionality of data.
Future areas of development
The data-rich and highly dimensional nature of microarray data serves as a model for future dissection and understanding of biological systems in general, including proteomics and integration of mRNA profiling and proteomics. The Working Group discussed data analysis needs within the microarray community and agreed that, along with the incorporation of QC, SOPs and optimized or customized signal/noise analyses in initial project signal generation, the back-end statistics needs to reach a commonly accepted method of dealing with the curse of dimensionality before microarrays can be reliably used in clinical trials. Statisticians need to focus more on representation of prediction results in terms of probabilities and associated measures of uncertainty, and reach a consensus on what is acceptable. In the meantime, it is likely that specific marker or diagnostic genes will be extracted from pilot profiling studies, and then only this small subset of genes will be used as a clinical trial endpoint. This data limitation approach removes much of the curse of dimensionality, but is liable to ignore the large majority of data, thereby decreasing the potential power of the study and bringing into question the use of microarrays in clinical trials.
A move towards the standardization of reporting of prediction accuracy would be desirable when assessing predictive accuracy through within-sample cross-validation. The Working Group suggests that one or more validation techniques be used when reporting predictive genes: leave-one-out and 10% cross-validation summaries, or true validation data sets. Communicating uncertainty about predictive performance is also key and will help evaluate results based on varying sample sizes. The Working Group suggests that until this information is routinely presented in published papers, it will be difficult to reach an acceptable consensus for use in clinical trials.
Conclusions
There are four key areas of optimization and standardization that are largely independent: study design, technical variability (QC/SOP of data generation), analysis method variation (signal/noise optimization using probeset algorithms and normalizations) and back-end statistical analyses. Statistics of clinical trial design is crucial: gene-expression data does not mitigate the need for sound and relevant design and analysis, nor does it challenge what we know about design. The field is quickly maturing from the small-chip-number hit-and-run type projects to those with a more robust study design. However, study design depends ultimately on appropriate powering of a study, which is greatly affected by both the chip-analysis algorithms that are used and the biostatistical data analysis.
Development of back-end statistical methods for data representation/summary and for high-level analysis remains an active area of research for both academic and commercial users, and is likely to remain so in the near future. We are some way from defining standards of summary signal intensities alone and even further from considerations of standardization of analytical methods for inference and prediction in clinical contexts. In regulated clinical studies, such standards will be enforced partly by the US FDA as submissions of medical test/device protocols emerge and increase in number. Even then, however, many approaches to data analysis and modelling will be used and developed, which is, of course, to be supported. It is very difficult to influence the research community, especially when the variety of problems that are encountered promotes the need for refined and new approaches.
Boxes
Box 1 | The MIAME guidelines for data reporting
The Microarray Gene Expression Data Society (MGED) is an international discussion group of microarray experts, with the primary goal of developing methods for data sharing between experimental platforms. The main output of this group has been the Minimum Information About A Microarray Experiment (MIAME) guidelines for microarray data annotation and reporting. The guidelines have been adopted by a number of scientific journals and have recently been endorsed for use by the US Food and Drug Administration and the US Department of Agriculture for pharmacogenomics projects.
The MIAME guidelines include descriptions of experimental design (number of replicates, nature of biological variables), samples used, extract preparation and labelling, hybridization procedures and parameters, and measurement data and specifications. These guidelines have been most important for the spotted cDNA and oligonucleotide experimental platforms (see Box 2) in which the flexibility in microarray design and utilization also leads to considerable variation in array data generation and reporting between different laboratories. The guidelines do not attempt to dictate how experiments should be done, but rather provide adequate information associated with any published or publicly available experiment so that the experiment can be reproduced.
Box 2 | Microarray experimental platforms
There are three different types of microarray in common use: spotted cDNAs, spotted oligonucleotides and Affymetrix arrays.
Spotted cDNA arrays
Spotted cDNA arrays typically use sets of plasmids of specific cDNAs in gridded liquid aliquots. The inserts of each clone are typically amplified by PCR, and a few picolitres are physically spotted onto glass slides by liquid-handling robots. Robotic spotters can spot 100,000 spots per slide, and duplicate sets of clones are often spotted. The advantages of spotted cDNA arrays are that the content of each microarray is determined by the researcher, with complete flexibility in number and type of cDNA clones spotted. Also, the cost per array is relatively low, as the clone sets are a PCR-renewable resource and the glass slides are themselves inexpensive. The amount of the RNA that corresponds to each spot is determined relative to a second control RNA solution that is hybridized to the same spot, and a ratio is obtained.
Disadvantages of spotted cDNA arrays include the variable amount of DNA spotted in each spot, the 10–20% 'drop out' rate of failed PCR reactions or failed spots and mis-identification of clones (that is, the spot is not what you think it is). Also, there is no control over the actual sequence of the clone. As many gene-coding sequences contain regions of sequence that are shared with other genes, there are questions of specificity of the hybridization to the relatively large cDNA inserts. Spotted cDNA arrays were embraced by most academic centres, owing to their flexibility and relatively low cost.
Spotted oligonucleotide arrays
These arrays are also built by liquid handling on glass slides; however, the input solution is a synthetic oligonucleotide (often 60–70-mers). The resulting spotted material is typically of known concentration, of known sequence and is single stranded (all advantages relative to spotted cDNAs). Most of the process can be automated, leading to less sample mix-up and less drop-out of samples.
Disadvantages of spotted oligonucleotides include the relatively high cost of synthesizing large numbers of large oligonucleotides and the non-renewable nature of the resource. Spotted oligonucleotide arrays are becoming increasingly available.
Affymetrix GeneChips
These microarrays are factory designed and synthesized. Design is done using software to choose a series of 11 25-mer probes from the 3' end of each transcript or predicted transcript in the genome; each of the 11 probes is then paired with a similar mismatch probe that is designed to contain a mutation in the centre. The latter serves as a form of control for hybridization specificity. Synthesis of arrays is done using light-activated chemistry and photolithography methods, and feature size can be reduced to approximately 8 μm2, with about 1 million probes in a 1.2 cm2 glass area. Probe-set algorithms interpret the signals from each 22-oligonucleotide probe set, and derive a single value (signal) from the patterns of hybridization to the 22 individual probes. This signal is then normalized to the entire microarray, or to the probe sets across an entire project.
For a more general discussion of normalization and analysis methods of different microarray platforms, the reader is referred to the excellent web information resource of the MGED group (see The MGED Data Transformation and Normalization Working Group in online links box).
Box 3 | The Tumor Analysis Best Practices Working Group
The Tumor Analysis Best Practices Working Group is a group of investigators who study the best practices of tumour analysis in humans taking part in clinical trials. The following authors are members of the Group:
Eric P. Hoffman is at the Research Center for Genetic Medicine, Children's National Medical Center, Washington DC 20010, USA. email: ehoffman@cnmcresearch.org
Avrum Spira is at The Pulmonary Center, Boston University Medical Center and the Bioinformatics Program, Boston University, Boston, Massachusetts 02118, USA. e-mail: aspira@lung.bumc.bu.edu
George Wright is at the Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institute of Health, Bethesda, Maryland 20892, USA. e-mail: wrightge@mail.nih.gov
Jonathan Buckley and Tim Triche are at the Children's Hospital, University of California, Los Angeles, California 90089, USA. e-mail: buckley@hsc.usc.edu; triche@hsc.usc.edu
Wendell Jones is at Expression Analysis Inc., Durham, North Carolina 27713, USA. e-mail: wjones@expressionanalysis.com
Ron Tompkins is at Harvard University, Boston, Massachusetts 02115, USA. e-mail: rtompkins@partners.org
Mike West is at the Institute of Statistics and Decision Sciences, Duke University, Durham, North Carolina 27708, USA. e-mail: mw@stat.duke.edu
Box 4 | Quality Control metrics for Affymetrix microarrays
RNA quality
Optical density 260/280 of 1.8–2.1 | Agilent Bio-Analyzer | Gel electrophoresis
cDNA/cRNA efficiency
>4-fold amplification from total RNA | 500–3,000 bp prior to fragmentation | 50–200 bp after fragmentation
Chip hybridization
Image inspection for defects | Scaling factors within two standard deviations within a project | MAS 5.0 present calls >25% for the A-SERIES ARRAYS, including the HG-U133Plus 2.0 array | Percentage present calls for the B- AND C-SERIES ARRAYS are typically lower | 3'/5' GAPDH ratios <3
Project normalization
The detection of statistical outliers for chips, probe sets or individual probe pairs requires normalization and analysis across an entire project. This is afforded by the dCHIP and ProbeProfiler, and other software packages. Data-analysis packages that rely on intra-chip normalization and scaling typically do not enable detection of statistical outliers.
Chip outliers
Probe-set outliers | Probe-pair outliers | Range in present calls <10%