pipelines.chromake
pipelines.chromake
Chromake pipeline
Chromake is used to align sequencing data of ChIP-seq and ATAC-seq experiments, identify peaks, and realize a read count. All those steps are realized using a snakemake pipeline included in this subpackage.
Notes
The ressources used can be modified. In particular, the number of cores can be specified in the config file while other ressources (memory, runtime) will necessite to copy and modify the snakefile included in this subpackage.
Chromake contains submodules (genomake.pipelines.chromake.scripts.config and genomake.pipelines.chromake.scripts.paths). Among them, paths is used by the snakefile to track the files it generate, while config was created to help the user create the YAML config file needed by snakemake.
Examples
To run the pipeline:
- you need at least 60 Go of RAM.
- To install the necessary programs. It can be done with conda using a file like:
name: genomake
channels:
- bioconda
- conda-forge
- anaconda
dependencies:
- bedtools
- bowtie2
- fastqc
- homer
- openssl
- picard
- python
- ucsc-bedgraphtobigwig
- ucsc-bedtobigbed
- samtools
- pip:
- git+https://github.com/Pierre9344/genomake
- MACS3 # can be modified to MACS2 and the pipeline will detect which version to call in the rule
- cutadapt
- deeptools
- multiqc
- snakemake-executor-plugin-slurm- you need a configuration file in a yaml format. Check
genomake.pipelines.chromake.scripts.config.create_example_configfor an example. The config submodule contains functions to generate such a config file from a samplesheet. - The configuration is composed of 3 main fields:
- SEQUENCINGS
- PROJECTS
- JOBS
SEQUENCINGS is used to define the sample of a sequencing run that share similar adapter for trimming the reads, genome reference, and input samples (for ChIP-seq, more than one input can be indicated but only the first one will be used when identifying peak).
Each sequencing must have a parameters field to indicates at least: - the genome reference for bowtie2 (BOWTIE2_REF) - a bed of blacklisted regions (BLACKLIST_BED), see https://github.com/Boyle-Lab/Blacklist to download the file corresponding to the genome you use - a file indicating the chromosome size (CHROM_SIZE), see https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/ to download the one for GRCh38/hg38. UCSC also host the files for other genomes such as mm10. - the genome used, either the .fa file of the reference used to build bowtie2 reference (toplevel.fa for ensembl) or a string such as hg38 or mm10 if the genome was configured in homer with configureHomer.pl (GENOME) - Adaptor trimming will be realized if the R1_ADAPTOR and the R2_ADAPTORfields are present
SEQUENCING:
MO211:
SAMPLES:
PSMD6_F1_Ci:
R1: <PATH_TO_R1_fastq>
TYPE: H3K27AC
R2: <PATH_TO_R2_fastq>
PSMD6_F1_Fa:
R1: <PATH_TO_R1_fastq>
TYPE: ATAC
R2: <PATH_TO_R2_fastq>
INPUT:
Input_Batch1:
R1: <PATH_TO_R1_fastq>
R2: <PATH_TO_R2_fastq>
PATH: <PATH_TO_R1_fastq>
R1_ADAPTOR: CTGTCTCTTATACACATCT (modify to the sequence corresponding to your samples or remove this field to not trim)
R2_ADAPTOR: CTGTCTCTTATACACATCT (modify to the sequence corresponding to your samples or remove this field to not trim)
PARAMETERS:
CUTADAPT: -q 20 --pair-filter=any (cutadapt paramters by default, modify or leave empty)
BOWTIE2_REF: <PATH>/index-bowtie-2.3.0/Homo_sapiens.GRCh38.dna.toplevel (modify to point to your reference for bowtie2)
BLACKLIST_BED: <PATH>/hg38-blacklist.v2.bed
GENOME: <PATH>/ensembl/release-99/Homo_sapiens.GRCh38.dna.toplevel.fa
CHROM_SIZE: <PATH>/hg38.chrom.sizesPROJECTS is used to regroup samples of different sequencing when realising peak calling and the minimal number of samples in which a peak is identified to be keep in the final count matrix. To be regrouped, the samples need to shares the same mark. The path of each project indicate where the files resulting of the peak calling will be created. It is recommanded to use different folders for each projects.
PROJECTS:
ATAC:
PROJECT_PATH: <PATH>
TYPE: ATAC
MIN_SAMPLES_FOR_PEAKS: 2
SEQUENCING:
- MO211
ChIP_H3K27ME3:
PROJECT_PATH: <PATH>
TYPE: H3K27ME3
MIN_SAMPLES_FOR_PEAKS: 2
SEQUENCING:
- MO203
ChIP_H3K27AC:
PROJECT_PATH: <PATH>
TYPE: H3K27AC
MIN_SAMPLES_FOR_PEAKS: 2
SEQUENCING:
- MO203
ChIP_H2AUB:
PROJECT_PATH: <PATH>
TYPE: H2AUB
MIN_SAMPLES_FOR_PEAKS: 2
SEQUENCING:
- MO203The JOBS field is used to indicate the number of cpu to use for multithreadings and the QOS when running jobs on a clusters with an executor like slurm.
JOBS:
CORES_PER_JOBS:
FASTQC: 10
CUTADAPT: 10
BOWTIE2: 30
QOS_INFOS:
short:
MaxWall: 1440
medium:
MaxWall: 4320
long:
MaxWall: 11520You can use the genomake cli API to run snakemake:
genomake chromake -c ./config.yaml --cores 250 --local-cores 1 --jobs 8All options can be checked with a:
genomake chromake -hThe “–others-snakemake” option allow to specify additionnal arguments in the final command. It can be useful to unlock the working directory, try a dry run or specify an executor (slurm in my example).
genomake chromake -c ./config.yaml --cores 250 --local-cores 1 --jobs 8 --others-snakemake "--executor slurm --default-resources slurm_account=singlecell clusters=nautilus --slurm-logdir ./logs --slurm-keep-successful-logs --slurm-delete-logfiles-older-than 0 --rerun-incomplete -np --unlock" -pThis command will print the command next as the “-p” argument was used, without it, the command will be executed):
snakemake --snakefile <PATH to snakefile in installation> --configfile ./config.yaml --retries 4 --cores 250 --jobs 8 --local-cores 1 --executor slurm --default-resources slurm_account=singlecell clusters=nautilus --slurm-logdir ./logs --slurm-keep-successful-logs --slurm-delete-logfiles-older-than 0 --rerun-incompleteFor more informations, check the Command Line Interface and the Get started pages of the documentation.