View on GitHub

Genomics Platform

reNEW, CPR, CGEN, ICMM

Using reference genomes with refgenie

Reference genomes used in genomics pipelines can be accessed via refgenie.

Please refer to DANGPU user guide for more details.

In short, you can find available reference genomes like this:

module load dangpu_libs python/3.7.13 refgenie/0.12.1
refgenie list

which will output a list of available genomes.

                                                                                Local refgenie assets                                                                                 
                                                                 Server subscriptions: http://refgenomes.databio.org                       
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ genome                       ┃ assets                                                                                                                                              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ hg38_refgenie, hg38          │ fasta, gencode_gtf, ensembl_gtf, bwa_index, bowtie2_index, star_index, hisat2_index, cellranger_reference, bismark_bt2_index, fasta_txome,          │
│                              │ salmon_sa_index                                                                                                                                     │
│ GRCm39_ensembl, GRCm39       │ fasta, ensembl_gtf, bowtie2_index, star_index, bismark_bt2_index, 10x_index, blacklist                                                              │
│ dm6_ensembl                  │ fasta, bowtie2_index                                                                                                                                │
│ GRCh38_dm6                   │ fasta, bowtie2_index                                                                                                                                │
│ GRCm39_dm6                   │ fasta, bowtie2_index                                                                                                                                │
│ GRCh38_legacy, hg, hg_legacy │ fasta, gencode_gtf, blacklist, gtf_TE, star_index, bowtie2_index, bismark_bt2_index, 10x_index                                                      │
│ mm10_legacy, mm, mm_legacy   │ fasta, gencode_gtf, blacklist, star_index, bowtie2_index, bismark_bt2_index, 10x_index                                                              │
│ dm6_FlyBase, dm6, fly        │ fasta, flybase_gtf, star_index, bowtie2_index, bismark_bt2_index                                                                                    │
│ puc19                        │ fasta, bismark_bt2_index                                                                                                                            │
│ lambda                       │ fasta, bismark_bt2_index                                                                                                                            │
│ Spombe_h90                   │ fasta, bowtie2_index                                                                                                                                │
│ Ecoli                        │ fasta, bowtie2_index                                                                                                                                │
│ sacCer3, yeast               │ fasta, ncbi_gff, star_index, bowtie2_index                                                                                                          │
│ GRCh38_ensembl, GRCh38       │ fasta, ensembl_gtf, gencode_gtf, blacklist, star_index, bowtie2_index, bismark_bt2_index, 10x_index                                                 │
└──────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

You can further find contents of each reference genome using list -g option:

refgenie list -g GRCh38

which gives you the information on available assets for the reference genome of interest:

                                             Local refgenie assets                                             
                              Server subscriptions: http://refgenomes.databio.org                              
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ genome                 ┃ asset (seek_keys)                                              ┃ tags              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ GRCh38_ensembl, GRCh38 │ fasta (fasta, fai, chrom_sizes, dir)                           │ release_111       │
│ GRCh38_ensembl, GRCh38 │ ensembl_gtf (ensembl_gtf, ensembl_tss, ensembl_gene_body, dir) │ release_111       │
│ GRCh38_ensembl, GRCh38 │ gencode_gtf (gencode_gtf, dir)                                 │ release_45        │
│ GRCh38_ensembl, GRCh38 │ blacklist (blacklist, dir)                                     │ ENCODE, CUTANDRUN │
│ GRCh38_ensembl, GRCh38 │ star_index (star_index, dir)                                   │ 2.7.11b           │
│ GRCh38_ensembl, GRCh38 │ bowtie2_index (bowtie2_index, dir)                             │ 2.5.3             │
│ GRCh38_ensembl, GRCh38 │ bismark_bt2_index (bismark_bt2_index, dir)                     │ 0.24.2            │
│ GRCh38_ensembl, GRCh38 │ 10x_index (10x_index, dir, filtered_gtf)                       │ 7.2.0, gex-2024-A │
└────────────────────────┴────────────────────────────────────────────────────────────────┴───────────────────┘

finally, you can find a path to a specific asset as <genome>/<asset>:<tag>

refgenie seek GRCh38/blacklist:CUTANDRUN

which gives you a path to the asset file:

/maps/projects/dan1/data/RefGenomes_reNEW/alias/GRCh38/blacklist/CUTANDRUN/GRCh38_blacklist.bed.gz

Reference genome options for human

For human reference genome there are several options:

GRCh38 - the most updated reference genome that we suggest for new projects starting from 2024 and on. We currently use this as our default reference genome for human.

GRCh38_legacy - The human reference genome the Genomics Platform used 2017-2023.

GRCh38_dm6 - hybrid genome between GRCh38_ensembl and dm6_ensembl. This can be used for alignments when you have used spike-in from the fly. This reference contains only fasta and bowtie2 index.

GRCh38_refgenie (same as GRCh38) - the reference genome pulled from refgenie. Genomics Platform does not use this reference but it is available for users. Please note that if you run nf-core pipelines and use GRCh38 as reference genome, it will use GRCh38 from AWS_iGenomes reference genome instead of refgenie local list.

Reference genome options for mouse

For mouse reference genome there are also several options:

GRCm39 - the newest reference genome that we recommend for new projects starting from 2024 and on. This genome is the default reference genome for running Genomics Platform pipelines.

mm10_legacy - This is the mouse reference genome we used 2017-2023.

GRCm39_dm6 - hybrid genome between GRCm39_ensembl and dm6_ensembl. This can be used for alignments when you have used spike-in from the fly.

Other reference genomes

There are various spike-in genomes included in refgenie:
dm6_ensembl
dm6_FlyBase
puc19
lambda
Spombe_h90
Ecoli
sacCer3

Genomes in detail

Below you can find more detailed description of some of the main genomes.

GRCh38

This is the default reference genoome for human for projects starting in 2024 and onwards.

refgenie list -g  GRCh38
                                             Local refgenie assets                                             
                              Server subscriptions: http://refgenomes.databio.org                              
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ genome                 ┃ asset (seek_keys)                                              ┃ tags              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ GRCh38_ensembl, GRCh38 │ fasta (fasta, fai, chrom_sizes, dir)                           │ release_111       │
│ GRCh38_ensembl, GRCh38 │ ensembl_gtf (ensembl_gtf, ensembl_tss, ensembl_gene_body, dir) │ release_111       │
│ GRCh38_ensembl, GRCh38 │ gencode_gtf (gencode_gtf, dir)                                 │ release_45        │
│ GRCh38_ensembl, GRCh38 │ blacklist (blacklist, dir)                                     │ ENCODE, CUTANDRUN │
│ GRCh38_ensembl, GRCh38 │ star_index (star_index, dir)                                   │ 2.7.11b           │
│ GRCh38_ensembl, GRCh38 │ bowtie2_index (bowtie2_index, dir)                             │ 2.5.3             │
│ GRCh38_ensembl, GRCh38 │ bismark_bt2_index (bismark_bt2_index, dir)                     │ 0.24.2            │
│ GRCh38_ensembl, GRCh38 │ 10x_index (10x_index, dir, filtered_gtf)                       │ 7.2.0, gex-2024-A │
└────────────────────────┴────────────────────────────────────────────────────────────────┴───────────────────┘

Characteristics

This genome genome uses fasta (GRCh38.p14 primary assembly from soft-masked genome) and ensembl_gtf (Ensembl release 111) to generate star_index (using STAR 2.7.11b) bowtie2_index (using bowtie2 v.2.5.3) bismark_bt2_index (using bismark v.0.24.2).

VERSION=111
wget -L http://ftp.ensembl.org/pub/release-${VERSION}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L http://ftp.ensembl.org/pub/release-${VERSION}/gtf/homo_sapiens/Homo_sapiens.GRCh38.${VERSION}.gtf.gz

Additional files

gencode_gtf: Gencode release 45
blacklist files contain regions prone to misalignment or bias in genomic analysis. blacklist:ENCODE is sourced from ENCODE project for use with ATAC/ChIP-seq and
blacklist:CUTANDRUN is sourced from Nordin et al. 2023 for use with CUT&RUN.

10x indexes

Default Cellranger index 10x_index for for gene expression is gex-2024-A which is a local copy of the precompiled Cellranger genome version 2024-A. However, for 100% compatibility with bulk genomes you can also use the tag 7.2.0 which has used refgenie local assets to compile a custom cellranger reference using cellranger version 7.2.0:

refgenie seek GRCh38/10x_index:7.2.0

Additional info

Chromosome names: 1, 2, 3, … , MT, X, Y Effective genome sizes can be found in deeptools documentation

GRCm39

This is the default reference genoome for mouse for projects starting in 2024 and onwards.

refgenie list -g  GRCm39
                                                Local refgenie assets                                                 
                                 Server subscriptions: http://refgenomes.databio.org                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ genome                 ┃ asset (seek_keys)                                              ┃ tags                     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ GRCm39_ensembl, GRCm39 │ fasta (fasta, fai, chrom_sizes, dir)                           │ release_111              │
│ GRCm39_ensembl, GRCm39 │ ensembl_gtf (ensembl_gtf, ensembl_tss, ensembl_gene_body, dir) │ release_111              │
│ GRCm39_ensembl, GRCm39 │ bowtie2_index (bowtie2_index, dir)                             │ 2.5.3                    │
│ GRCm39_ensembl, GRCm39 │ star_index (star_index, dir)                                   │ 2.7.11b                  │
│ GRCm39_ensembl, GRCm39 │ bismark_bt2_index (bismark_bt2_index, dir)                     │ 0.24.2                   │
│ GRCm39_ensembl, GRCm39 │ 10x_index (10x_index, dir, filtered_gtf)                       │ 7.2.0, gex-2024-A        │
│ GRCm39_ensembl, GRCm39 │ blacklist (blacklist, dir)                                     │ CUTANDRUN, EXCLUDERANGES │
└────────────────────────┴────────────────────────────────────────────────────────────────┴──────────────────────────┘

Characteristics

This genome references use fasta (GRCm39 primary assembly from soft-masked genome) and ensembl_gtf (Ensembl release 111) to build star_index for STAR v.2.7.11b, bowtie2_index for bowtie2 v.2.5.3, bismark_bt2_index for Bismark v.0.24.2.

VERSION=111
wget -L http://ftp.ensembl.org/pub/release-${VERSION}/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
wget -L http://ftp.ensembl.org/pub/release-${VERSION}/gtf/mus_musculus/Mus_musculus.GRCm39.${VERSION}.gtf.gz

Additional files

blacklist files are included:
blacklist:EXCLUDERANGES from Ogata et al. 2023 for use with ChIPseq and ATACseq assays and
blacklist:CUTANDRUN from Nordin et al. 2023 for use with CUT&RUN and CUT&TAG assays.

10x indexes

Default Cellranger index 10x_index for for gene expression is gex-2024-A which is a local copy of the precompiled Cellranger genome version 2024-A. However, for 100% compatibility with bulk genomes you can also use the tag 7.2.0 which has used refgenie local assets to compile a custom cellranger reference using cellranger version 7.2.0:

refgenie seek GRCh38/10x_index:7.2.0

Additional info

Chromosome names: 1, 2, 3, … , MT, X, Y
Effective genome sizes can be found in deeptools documentation

GRCh38_legacy

Genomics platform has used this genome reference until 2023. We do not recommend using this genome, unless you have good reasons to do so.

refgenie list -g GRCh38_legacy
                                               Local refgenie assets                                               
                                Server subscriptions: http://refgenomes.databio.org                                
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ genome                       ┃ asset (seek_keys)                                      ┃ tags                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ GRCh38_legacy, hg, hg_legacy │ fasta (fasta, fai, chrom_sizes, hg38_chrom_sizes, dir) │ default                 │
│ GRCh38_legacy, hg, hg_legacy │ gencode_gtf (gencode_gtf, dir)                         │ default                 │
│ GRCh38_legacy, hg, hg_legacy │ blacklist (blacklist, dir)                             │ ENCODE, CUTANDRUN, ATAC │
│ GRCh38_legacy, hg, hg_legacy │ gtf_TE (gtf_TE, dir)                                   │ default                 │
│ GRCh38_legacy, hg, hg_legacy │ star_index (star_index, dir)                           │ 2.7.2d                  │
│ GRCh38_legacy, hg, hg_legacy │ bowtie2_index (bowtie2_index, dir)                     │ default                 │
│ GRCh38_legacy, hg, hg_legacy │ bismark_bt2_index (bismark_bt2_index, dir)             │ 0.22.3                  │
│ GRCh38_legacy, hg, hg_legacy │ 10x_index (10x_index, dir)                             │ default, gex-2020-A     │
└──────────────────────────────┴────────────────────────────────────────────────────────┴─────────────────────────┘

Characteristics

For GRCh38_legacy reference, fasta (Encode fasta GRCh38.p13) and gencode_gtf (from gencode.v32) were used to generate star_index using STAR v.2.7.2d, bowtie2_index using bowtie2 v2.3.4.1, bismark_bt2_index using bismark v0.22.3, gtf_TE for using with TETranscripts.

Additional files

10x_index: 10x index for GEX, issued by 10x in 2020 as refdata-gex-GRCh38-2020-A
blacklist: ENCODE blacklist sourced from nf-core/atac assets

Additional info

Chromosome naming: chr1, chr2, … chrM, chrX, chrY. Effective genome sizes can be found in deeptools documentation

mm10_legacy

Genomics platform has used this genome reference until 2023. We do not recommend using this genome, unless you need to finish old projects with this genome.

refgenie list -g mm10_legacy
                                           Local refgenie assets                                            
                            Server subscriptions: http://refgenomes.databio.org                             
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ genome                     ┃ asset (seek_keys)                          ┃ tags                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ mm10_legacy, mm, mm_legacy │ fasta (fasta, fai, chrom_sizes, dir)       │ default                        │
│ mm10_legacy, mm, mm_legacy │ gencode_gtf (gencode_gtf, dir)             │ default                        │
│ mm10_legacy, mm, mm_legacy │ blacklist (blacklist, dir)                 │ ENCODE, GUAVA, CUTANDRUN, ATAC │
│ mm10_legacy, mm, mm_legacy │ star_index (star_index, dir)               │ 2.7.2d                         │
│ mm10_legacy, mm, mm_legacy │ bowtie2_index (bowtie2_index, dir)         │ default                        │
│ mm10_legacy, mm, mm_legacy │ bismark_bt2_index (bismark_bt2_index, dir) │ 0.22.3                         │
│ mm10_legacy, mm, mm_legacy │ 10x_index (10x_index, dir)                 │ GEX, GEX_GFP, ARC, gex-2020-A  │
└────────────────────────────┴────────────────────────────────────────────┴────────────────────────────────┘

Characteristics

mm10_legacy genome references use fasta (GRCm38.p5 primary_assembly) and gencode_gtf (gencode vM15) to generate star_index for STAR v.2.7.2d , bowtie2_index and bismark_bt2_index for Bismark v.0.22.3.

Additional files

blacklist files are included:
blacklist:ENCODE from ENCODE and
blacklist:GUAVA from GUAVA
10x_index files (issued by 10x in 2020): 10_index:GEX for gene expression profiling, 10x_index:ARC for single cell ATAC. For internal use we have also generated , 10x_index:GEX_GFP for gene expression profiling with GFP reference included.

Additional info

Chromosome naming: chr1, chr2, … chrM, chrX, chrY.
Effective genome sizes can be found in deeptools documentation

Go back to the Genomics Platform home