Get started

The genomake package is a collections of scripts and pipeline written in python to analyze with genomic data.

This package is a WIP.

Installation

A Genomake for the chromake pipeline can be installed using pip.

pip install git+https://github.com/Pierre9344/genomake

If you wish to install it in a conda environment, you can use a yml containing the lineS next.

name: genomake
channels:
  - bioconda
  - conda-forge
  - anaconda
dependencies:
  - bedtools
  - bowtie2
  - fastqc
  - homer
  - openssl
  - picard
  - python
  - ucsc-bedgraphtobigwig
  - ucsc-bedtobigbed
  - samtools
  - pip:
      - git+https://github.com/Pierre9344/genomake
      - MACS3 # can be modified to MACS2 and the pipeline will detect which version to call in the rule
      - cutadapt
      - deeptools
      - multiqc
      - snakemake-executor-plugin-slurm

conda create -f genomake.yml -y

For a simple installation on a local computer to access the functions to create and check the configuration file you can directly use pip:

pip install git+https://github.com/Pierre9344/genomake

Chromake

Objectives

The chromake pipeline is a pipeline to analyze ChIP-seq (H3K27ac, H3K27me3, and H2Aub marks), and ATAC-seq samples. It use fastq samples to:

Realize an optional adaptor trimming (cutadapt)
Align samples to a genome reference using bowtie2
Identify enriched peaks (macs3)
Realize quality controls (fastqc, samtools stats, samtools flagstat, multiqc)

This pipeline is a WIP. It currently realize the alignment of the samples, and generate UCSC compatible tracks but doesn’t realize the peak calling.

Configuration file

To use the pipeline, you need a YAML configuration file. This files must contains 3 fields:

SEQUENCING: informations concerning one or more sequencing (global path, samples fastq, adaptor for trimming, genome for the alignment, …). The PATH indicated for a sequencing will be used to output the QC and bam files
PROJECTS: used to regroup samples of multiple sequencing in a single project for the peak calling. It is recommanded to indicate different paths for different projects.
JOBS: information about the number of cores to use for multithreading, and qos (when using an executor like slurm).

You can generate an example configuration file with:

from genomake.pipelines.chromake.scripts import config as chr_config
chr_config.create_example_config("test_config.yaml")

Other functions can be used to convert the configuration file in a samplesheet (excel or csv) or to convert the samplesheet into a configuration file:

# convert to a samplesheet
chr_config.create_samplesheet_from_config("test_config.yaml", "test_samplesheet.xlsx", False, True)

# Convert back to a config file. Sample R1, R2, and Marks can be obtained from the samplesheet.
# Other informations need to be indicated using the other arguments of the function or to be added manually in the configuration file.
chr_config.create_config_from_table(
    "test_samplesheet.xlsx",
    "test_config2.yaml",
    proj_paths = {
        "ChIP_H3K27AC": "/scratch/.../ChIP_H3K27AC",
        "ChIP_H3K27ME3": "/scratch/.../ChIP_H3K27ME3",
        "ChIP_H2AUB": "/scratch/.../ChIP_H2AUB",
        "ChIP_ATAC": "/scratch/.../ChIP_ATAC"
    },
    jobs = {
        "CORES_PER_JOBS": {
            "FASTQC": 10,
            "CUTADAPT": 10,
            "BOWTIE2": 30,
            "SAMTOOLS_QC": 5,
            "MULTIBAMSUMMARY": 5,
            "BEDTOOLS": 5
        },
        "QOS_INFOS": {
            "short": {"MaxWall": 24 * 60}, # 1 day
            "medium": {"MaxWall": 3 * 24 * 60}, # 3 days
            "long": {"MaxWall": 8 * 24 * 60}, # 8 days in minutes
        },
    }
    sequencings = {
        "MO203": {
        "PATH": "/scratch/.../MO203/",
        "R1_ADAPTOR": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA",
        "R2_ADAPTOR": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
        "PARAMETERS": {
            "CUTADAPT": "-q 20 --pair-filter=any",
            "BOWTIE2_REF": "<path to genome reference build for bowtie2>",
            "BLACKLIST_BED": "<Path to blacklist file in bed format, see https://github.com/Boyle-Lab/Blacklist >",
            "GENOME": "<path to the sequencing genome for bowtie2>",
            "CHROM_SIZE": "<path to file wih the size of chromosome (can be found on UCSC) >"
        }
    },
    "MO208": {
        "PATH": "/scratch/.../Pierre_Solomon/MO208/",
        "R1_ADAPTOR": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA",
        "R2_ADAPTOR": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
        "PARAMETERS": {
            "CUTADAPT": "-q 20 --pair-filter=any",
            "BOWTIE2_REF": "<path to genome reference build for bowtie2>",
            "BLACKLIST_BED": "<Path to blacklist file in bed format, see https://github.com/Boyle-Lab/Blacklist>",
            "GENOME": "<path to the sequencing genome for bowtie2>",
            "CHROM_SIZE": "<path to file wih the size of chromosome (can be found on UCSC)>"
        }
    },
    "MO211": {
        "PATH": "/scratch/.../MO211/",
        "R1_ADAPTOR": "CTGTCTCTTATACACATCT",
        "R2_ADAPTOR": "CTGTCTCTTATACACATCT",
        "PARAMETERS": {
            "CUTADAPT": "-q 20 --pair-filter=any",
            "BOWTIE2_REF": "<path to genome reference build for bowtie2>",
            "BLACKLIST_BED": "<Path to blacklist file in bed format, see https://github.com/Boyle-Lab/Blacklist >",
            "GENOME": "<path to the sequencing genome for bowtie2>",
            "CHROM_SIZE": "<path to file wih the size of chromosome (can be found on UCSC) >"
        }
    }
})

# Check the configuration file
with open("test_config2.yaml") as stream:
    config = yaml.safe_load(stream)
chr_config.check_config_format(config, raise_error=False)

Launching the pipeline

Once the configuration file is created, chromake can be launched using the CLI included in genomake:

genomake chromake -c ./config.yaml --cores 250 --jobs 10

If you need to:

use an executor
do a dry run
unlock the directory

Then you can use the --others-snakemake argument. Here an example for the slurm executor:

genomake chromake -c ./config.yaml --cores 250 --local-cores 1 --jobs 8 \
   --others-snakemake "--executor slurm --default-resources \
   slurm_account=<your_slurm_account> clusters=<name_cluster_for_computation> --slurm-logdir \
   ./logs --slurm-keep-successful-logs --slurm-delete-logfiles-older-than 0 \
   --rerun-incomplete"

You can check the final command without launching it using the -p argument of genomake chromake.

Outputs

The outputs of chromake will be placed in the directories of the sequencings (trimmed fastq, qc, bam, bedGraph track) or the directories of the projects (bed resulting from the peak calling).

Sequencing:

BAM: bam files resulting from the alignment and their index. Two versions are kept. The one outputed by bowtie 2 and a version trimmed (the pipeline only keep the standard chromosome) and sorted.
BED: bed files generated by bedtools genomecov using the sorted bam files.
BEDGRAPH: folder contening the begGraph file for UCSC visualization
HOMER: folder generated by homer makeTagDirectory
QC: results of various qc (fastqc and multiqc, cutadapt logs, bowtie logs, samtools flagstat, samtools stats, size of the fragments)
TRIMMED: trimmed fastq path