User Tools

Site Tools


release_notes_on_pipeline

Release notes on pipeline


  • beta 0.6.2 (December 2019)
    • Minor changes:
      • [improvement] download_genome_files_from_ncbi.pl : Added option -SKIP_GENBANK_IF_REFSEQ_AVAILABLE to avoid redundant download of the same organism as both refseq and genbank
      • [bug] fixed bug parsing genbank files : the field “comment” from genbank and embl files was not parsed correctly
      • [bug] new script generate_organisms_annotations_as_sql__one_orga_per_file.pl (default in pipeline) to solve problem of grouping molecules into organisms. It assumes one new organism entry per genbank or embl file (the user is in charge of this part). Multiple elements such as chromosomes and plasmides must be back to back in the same file.
      • [improvement] splitting genome files into individual accession number files in was not clean; files in are not modified after pipeline runs
  • beta 0.6.1 (December 2018)
    • Major changes:
      • [improvement] : speed performance improvement for the algorithm that find syntenies (from 4-6 min to 25 seconds, 12X improvement) : alignment matrices are not rebuild every times after each scoring path (from highest to lowest) is processed
      • [improvement] : use of ncbi “assembly info” data to organize molecules (chromosomes, plasmids, etc.) into organisms
      • [improvement] : creation of a master script that controls the substeps of the pipeline
      • [improvement] : generating data on fusion gene and CDSs with multiples alignments that are within 1% best bit score. Remark : header fusion gene = prot_fusion_id, q_organims_id, q_element_id, q_gene_id, s_organims_id, s_element_id, s_gene_id, isMirrorData, s_rank_among_all_s_fusions, lengthTotalCoverageSOnQ, lenghtNewContribCoverageSOnQ, pid, hsp_len, mismatches, gaps, qstart, qend, qlength, hstart, hend, hlength, evalue, bits, bdbh, rank, geneIdBestSingleLongMatchHidingProtFusion, numberSingleLongMatchHidingProtFusion ; Header gene families where multiple match within 1% of best bits score = close_best_match_id, q_organims_id, q_element_id, q_gene_id, s_organims_id, is_mirror_data, number_of_close_matchs, contain_bdbh, pid_best_match, evalue_best_match, bits_best_match, list_close_match_gene_ids
      • [improvement] : non-generation of mirror data by performing of blastp on only half of the matrix of comparisons instead of each organism against all. As a result, the number of job Blast is reduced and the size of the database is divided by two.
      • [improvement] speed performance improvement for the post processing steps (after generation of orthologs and synteny data but before database insertion) by leveraging on parallel processing and overlapping processes of different steps (scripts can run on partial output from previous steps) : takes 60 days instead of 400 days for 5500 genomes (7X faster)
      • [improvement] speed performance improvement for the steps that presort the list of organisms based on orthologs or synteny scores by leveraging on parallel processing (up to 10, so 10X faster). Remark : generate_sorted_list_comp_elements.pl, integrate_sorted_list_comp_orga_whole.pl
      • [improvement] memory usage performance improvement for the step that deal with the taxonomic tree: RAM usage from 5.7 Go -6.3 Go to 655 Mo-1.2 Go (~5X improvement). Remark : generate_NCBITaxonomy
      • [improvement] pipeline more reliable to failure: if a genomes insertion run fails, the pipeline proposes to delete the data that were partially inserted or to try to process the partially inserted data again.
    • Minor changes:
      • [feature] : script to automatically download annotated genome files of interest from ncbi
      • [improvement] : do not allow very long gaps within syntenies; option -max_gap_size_for_creation_penalty (-mgsc) that defines the length of a gap being applied the gap_opening penality before being applied the gap_exension penality
      • [improvement] : run_Insyght_pipeline.pl: check that the child processes launched are still alive while waiting for them, else die with error → useful to detect kernel kills due to lack of resources.
      • [improvement] : generate_sorted_list_comp_elements.pl compatible with no mirror data
      • [improvement] : Task_blast_all_generator_for_IDRIS.pl : option to use environment variables in bash script (${QSUB_LOG_DIR}, ${PLAST_EXEC_PATH}, ${FASTA_MAKEBLASTDB_INPUT_DIR}, ${BLAST_OUTPUT_DIR}, ${MARKER_DONE_BLAST_DIR})
      • [bug] : file name generated by scripts when setting the to CLUSTER_ORGANISM could be very long and problematic for most unix file systems (limit filename lenght is 255, ‘getconf NAME_MAX /’, and limit path lenght is 4096, ‘getconf PATH_MAX /’ : keep only a representative organism id for the name of the file /list_cluster/list_cluster.txt
      • [improvement] : make align able to handle blast output file that includes many different organisms
      • [bug] : for the software that find syntenies (ComProMix), turn on flag -Wall -Wshadow -Wextra -std=c++11 and correct associated warnings ; get rid of warning at compile time
      • [bug] : for the software that find syntenies (ComProMix), if multiple best highest scores, take the longest synteny among all
      • [improvement] : for the software that find syntenies (ComProMix), possibility to print tandem duplication if it involve a bdbh (successive homologs not in diagonal but next to each other or up-below in the matrix). Remark : _tandem_dups_table.tsv
      • [improvement] : for the software that find syntenies (ComProMix), print syntenies that branch to other bigger syntenies in a special file. Remark : _isBranchedToAnotherSynteny_table
      • [improvement] : replace atoi by std::stoi and handle related exception accordingly in HomologyMatrix.cc
      • [improvement] : modify MakeFile to not depends on psql librairies when compiling main_align
      • [improvement] : [Pipeline] update to blast 2.6
      • [bug] : remove special characters that causes problem while parsing genome file (i.e gene name with backslash for example)
      • [improvement] : update scripts to new database schema 3
      • [improvement] : modified script Task_add_alignment_integrator_for_IDRIS.pl with option -TRIM_UNUSED_ALIGNMENT_PARAMS_DATA. When turned ON, do not insert a row in the table alignment_params for a particular comparison of pairwise elements if it doesn't lead to any synteny or ortholog (table alignments and alignment_pairs are empty) ; gain of disk usage of 5 - 40% depending on the average number of elements per organisms (most gain especially when dealing with organisms with many contigs) ; the script is 3-4% slower with option
      • [improvement] : modified script perl Task_add_alignment_integrator_for_IDRIS.pl with option -FILE_BLACKLISTED_ORGANISM_IDS to not insert data for the organism_ids listed in the blacklist file (1 organism id per line)
      • [improvement] : better handling of concurrent access to file for script process_Task_add_alignment_parse_tsv_output_Replacing_Forward_Reverse.pl
      • [improvement] : added RESTRICT_TO_LIST_ASSEMBLY_ACCESSION option to download_genome_files_from_ncbi.pl
      • [improvement] : add option -OUTPUT_DIR and -INPUT_DIR to generate_sorted_list_comp_elements.pl and integrate_sorted_list_comp_orga_whole.pl
      • [improvement] : added database check not do redo computation of a given element id if the flag CONTINUE_MULTI_RUNS_COMPUTATION is activated
      • [bug] : fixed an error in generate_sorted_list_comp_elements.pl : wrong computation of alignemnt score when no mirror data
      • [improvement] : database integrity not affected when an error occur and data already inserted in previous runs
release_notes_on_pipeline.txt · Last modified: 2020/01/08 15:09 by thomas.lacroix@inrae.fr