Background:

Aggregating and analyzing WGA (Whole genome alignment) data is sufficiently troublesome. This site provides utilities and services that are useful for manipulating EXISTING WGA datasets for human (Homo sapiens), mouse (Mus musculus), and rat (Rattus Norvegicus). Particularly,

  1. Perl data manipulation scripts
  2. A MySQL database
  3. A WGA Web service over Tomcat
  4. SOAP::Lite and Apache Axis web service examples

As of February 17th, 2004 we support both Berkeley and UCSC Human-Mouse-Rat alignment datasets. However, we will attempt to add more species as data is made available. (The current schema's chromosome focus and the low coverage of the chimp data doesn't make the addition of this species very fruitful right now, that is just my opinion though)

  1. Berkeley data
  2. UCSC data

If you find this resource extremely useful, please feel free to acknowledge it and/or Stephen Montgomery

Web service for human-mouse-rat alignments

To access the hmr data directly using perl or java, we have set-up a web service on coast.bcgsc.bc.ca. NOTE: To run the perl client, you must have the SOAP::Lite module installed. You can get this module here, Download SOAP::Lite. We have tested this with version 0.60 Beta 1 for UNIX.

Services

This document describes the services that are exported over this web service. This spec is written in java but can be applied to perl by looking at the examples below. MGA-Spec.pdf

Perl Examples

  1. Download gap positions: testGaps.pl
  2. Download alignment positions: testAlignments.pl

Source

Download the web service Java code version 1.0 here. Examples are provided within under the junit classes. Run using ant test. mga.tar.gz Version 1

MySQL service:

To access our database, connect to db02.bcgsc.bc.ca using user "ensembl" and pass "ensembl". (These are named ensembl because of db02.bcgsc.bc.ca also acts as an EnsEMBL mirror for Sockeye). The UCSC data is in the hmr_ucsc database and the berkeley data is in the hmr_berkeley database. The connection information may change soon to force users to obtain individual accts. (Allows us to kill runaway queries in a user specific manner - instead of just killing everybody).

Login

  • User: ensembl
  • Pass: ensembl
  • Host: db02.bcgsc.bc.ca

Web-based

To view the contents of the hmr database on db02 and run read-only queries through a web browser, click HMR Database viewer

Scheme Diagram

Building the Berkeley data

The Berkeley data comes in XMFA format. To build the berkeley data, we manually downloaded it from the URL above and ran the following build steps

Build steps

  1. Download the Berkeley data from Berkeley data
  2. One file for each chromosome, so we handwrote the cluster job file
  3. Ran job on CMSGSC cluster using table creation script buildTables.pl
  4. Created database using command CREATE DATABASE hmr_berkeley in MySQL
  5. Found all the SQL table creation files find_sql_files.sh
  6. Created a MySQL SQL loader from script createMySQLDatabaseSHLoader.pl
  7. Ran the output script using SH from the last step
  8. Changed to the tables/ directory and ran mysqlimport -u smontgom -pMYPASS -h db02 hmr_berkeley *.txt.table

Note: Each script may require you to check the input parameters. I will create a tar.gz in the future to download these so they work out of the box. There are a few absolute paths right now to support running jobs on a cluster.

Building the UCSC data

UCSC data comes in axt format, while Berkeley data comes in a XMFA format. To build the UCSC data, we first converted this to XMFA and then transfer our XMFA data to the table schema above. There are individual tables for each chromosome

Build steps

  1. Obtained UCSC data using FTP script getUCSCAlignmentData.pl
  2. Found all axt files in UCSC data find_axt_files.sh
  3. Created Axt to XMFA cluster job list createAxt2XmfaJobList.pl
  4. Made human, mouse, and rat property files to allow for strand correction Human, Mouse, and Rat.
  5. Ran job on CMSGSC cluster using converter script axt2xmfa.pl
  6. Found all xmfa files in converted data find_xmfa_files.sh
  7. Created XMFA to tables cluster job list createXmfa2TablesJobList.pl
  8. Ran job on CMSGSC cluster using table creation script buildTables.pl
  9. Created database using command CREATE DATABASE hmr_ucsc in MySQL
  10. Found all the SQL table creation files find_sql_files.sh
  11. Created a MySQL SQL loader from script createMySQLDatabaseSHLoader.pl
  12. Ran the output script using SH from the last step
  13. Changed to the tables/ directory and ran mysqlimport -u smontgom -pMYPASS -h db02 hmr_ucsc *.txt.table

Note: Each script may require you to check the input parameters. I will create a tar.gz in the future to download these so they work out of the box. There are a few absolute paths right now to support running jobs on a cluster.

Contact:

Contact Stephen Montgomery, smontgom@bcgsc.bc.ca for more information. All scripts on this page are copyright under the Mozilla Public License (MPL). The MGA Service code is licenced under the Creative Common's Attribution-NonCommercial license. If they are useful for anything, please let me know.


top Canada's Michael Smith Genome Sciences Centre | Genetics Graduate Program (UBC) | Want bioinformatics training??? | Vancouver Bioinformatics Users Group

(c) 2004 Stephen Montgomery, Canada's Michael Smith Genome Sciences Centre