ENCODE Virtual Machine and Cloud Resource
The ENCODE consortium have published an integrated analysis of ENCODE genome wide data at http://www.nature.com/encode/. Every analysis presented in the paper depends upon specific software processing that has a series of source data files, that are then transformed into output files for that analysis from which the final figure(s) and statements in the paper are made. As part of the supplementary material for this paper, we have established a virtual machine instance of the software, using the code bundles from ftp.ebi.ac.uk/pub/databases/ensembl/encode/supplementary/, where each analysis program has been tested and run. Where possible the VM enables complete reproduction of the analysis as it was performed to generate the figures, tables or other information. However in some cases the analysis involved highly parallelised processing within a specialised multiprocessor environment. In these cases, a partial example has been implemented leaving it to the reader to decide whether and how to scale to a full analysis. We hope that this structure provides the opportunity to run the same analyses in the wild.
Instructions for accessing the VM are provided further down on this page. Once the VM is started, you will access a Linux command line interface. For instructions on how to run the analyses, read the various README (“more README”) files within the directories that give detailed instructions, including line-by-line commands.
We have emphasized transparency in this process, meaning that we have exposed the large diversity of scripting languages and software components used by the various analysts in the project. This diversity in analysis methods should not be a surprise to any scientist working in large-scale genomics, but might be confusing or frustrating for people with less large-scale data handling experience. We apologize in advance for this diversity, but it is important to realize that our goal here is not to provide easy-to-use programs, or robust engineering solutions (there are separately funded projects to create such things), but rather to provide scientific transparency of our analytical results. By having the input data sets, a text description of the method, functioning code implementing the method and finally the output, we hope to provide a highly transparent view of the analysis we have performed. During implementation of the code bundles to establish the VM, there have necessarily been tweaks to the code and installation of packages that had been omitted from the code bundles through oversight. We have trialled the code in VM ourselves, and using only the VM can recreate the expected output.
For inquiries about the content of the supplementary information VM and specifically the content of the code, please email first the joint author email encode_authors@ebi.ac.uk , stating the inquiry and section; please do not email the analyst directly without making contact through this address so we can ensure that commonly asked questions are not a burden to analysts. Although the main analytical programs can be run on many different datasets in different environments, please do not consider this collection of programs and scripts to be a portable analysis system.

Cloud Instance

Instantiate an Amazon EC2 instance to examine the figures in the cloud.

  1. If you do not already have an AWS account, register with amazon.
  2. Follow the detailed instructions here to get your access key and secret key
  3. Go to Cloud Launch and enter a name for your cluster, a password for the interface, and your keys, and submit the form. The request may take a moment to complete.
  4. After saving the private key, navigate to the address presented on the page. This is the public web address for your instance, and may not respond immediately depending on how fast the instance boots. It should initially say "No Galaxy Instance Running", but click the link to go to the Cloud Console.
  5. When the CloudMan interface is accessible (using the password you defined previously), you should see the "Initial Cluster Configuration" dialog. Click "Show more startup options" and enter the share string:
    "cm-c81ed62096969d3bc996433b11d3cc0f/shared/2012-09-05--14-29/" (no quotes, but do include everything else) in the "Share an instance" cluster startup box, and initialize your cluster.
  6. CloudMan will now customize your instance as a clone of the ENCODE machine. This process may take a minute or two but is finished when the log shows 'Done running post_start_script' and all services are 'green' as indicated by the interface.
  7. Now, simply ssh in to your instance as the 'figures' user and you're good to go. The command should be something like this: 'ssh -i cloudman_keypair.pem figures@{your_amazon_instance}'
  8. Remember to terminate your instance when finished to avoid incurring additional costs. You can always start a new instance with the same cluster name to get back to where you were, until you terminate and delete (via the cloud console).

Downloadable Virtual Machine

  1. Download VirtualBox
  2. Download the Virtual Machine ENCODE.OVA (18GB, MD5)
  3. Import the VM into VirtualBox
    1. First, make a backup copy of the downloaded .ova file(s). If something goes wrong you can always make a new copy.
    2. Import the VM image into VirtualBox by either starting the downloaded .ova file directly, or by launching VirtualBox and navigating to File → Import Appliance and opening the file.
    3. This will display the Appliance Import Settings window. Click the Import button.
    4. It may then take several minutes for VirtualBox to import the VM. Once it is done, a new VM will appear in the left pane in the 'powered off' state.
    5. The default settings should be mostly appropriate, but one that must be turned on is VT-x/AMD-V Hardware Virtualization. You can find this for your virtual machine in Settings -> System -> Acceleration.
  4. Start your new ENCODE VM
  5. Log in as the user 'figures' with the password 'PY4G8GAr' (case sensitive)

At this point, you're ready to go (see the note below). From the figures user home directory, relevant directories are:

  • bin - some executables and script in common between different analyses.
  • commonData - data in common between some analysis. Structure is the same as the public ftp archive as given in the supplementary info.
  • figures - Runnable versions of the code for each of the 10 figures in the ENCODE integration paper. Directories are numbered as for the final figures.
  • lib - R libraries used in common
  • manuscript - copies of the submitted version of the manuscript, supplementary info and figures.
  • R - required R packages
  • supplementary - code for supplementary information and tables, plus some code dropped during th ereview process.
  • tables - code for the generation of main paper table 1. Other tables are in supplementary/

Overall to run the code for a figure or supplementary info, you should cd into the relevant directory, where there should be a README file with step by step instructions for running the code. In some cases, because the analysis is the result of a long and complex pipeline, the code here may work on an intermediate result. Also in some cases the full analysis requires large scale analysis of many datasets which are typically implemented on a compute farm. In these case we have provided a subset of the analysis as an example, which can then be scaled up if the user chooses to do so.

Note: There is an issue with the currently available virtual machine preventing one of the figures from running. The download available here will be fixed shortly, but io fix this on your copy of the VM you should be able to execute the following commands:

  • figures@figures-vm:~$ cd /mnt/galaxyData
  • figures@figures-vm:/mnt/galaxyData$ sudo rm -r encode_figs
  • figures@figures-vm:/mnt/galaxyData$ sudo ln -s ~ /mnt/galaxyData/encode_figs
  • figures@figures-vm:/mnt/galaxyData$ cpan Sort::Naturally

For more information about the ENCODE project, see: http://genome.ucsc.edu/ENCODE/