Looking for passionate Ph. D. and M.S. students who want to study the research area of data science and Bioinformatics. If interested, please contact to Dr. Gangman Yi.
Research Area
Machine Learning & Bioinformatics
Research
Gene Annotation
Mar 29, 2018
Genome Search Plotter
Sep 13, 2017
geneCo
Jan 27, 2019
C-Hunter
May 1, 2007
SClassify
Aug 9, 2012
cPlot
Sep 21, 2022
PROFESSOR
Division of AI Software Convergence
Dr. Gangman Yi, Ph. D.
Dongguk University, 30 Pildong-ro 1-gil
Seoul, 04620, Korea
Office : #9114, New Engineering Building
E-mail:gangman@dongguk.edu
Phone :+82-2-2260-3340
Dr. Yi received the master’s and Ph. D. degrees in Computer Science from Texas A&M University, USA, in 2007 and 2011, respectively. In 2011, he joined the System Software Group, Samsung Electronics, Suwon, South Korea. He was with the Department of Computer Science and Engineering, Gangneung-Wonju National University, South Korea, in 2012. Since 2016, he has been with the Department of Multimedia Engineering, Dongguk University, Seoul, South Korea. He has been researched in an interdisciplinary field of researches. His research focuses especially on the development of computational methods to improve understanding of biological systems and its big data. He actively serves as a Managing Editor and a Reviewer of international journals, and the Chair of international conferences and workshops.
My research focuses on the development of computational methods to improve understanding of big data from biological systems.
2023.9 - current : Professor at Division of AI Software Convergence, Dongguk University, Seoul, Korea
2018.9 - 2023.8 : Associate professor at Dept. of Multimedia Engineering, Dongguk University, Seoul, Korea
2016.9 - 2018.8 : Assistant professor at Dept. of Multimedia Engineering, Dongguk University, Seoul, Korea
2012.3 - 2016.8 : Associate professor at Dept. of Computer Science & Engineering at Gangneung-Wonju National University, Korea
2011.5 - 2012.2 : Samsung Electronics, Suwon, South Korea
Visiting associate professor at The University of British Columbia, BC, Canada, 2022.08-2023.08
Director of the graduate school, Dongguk University, 2021.02-2022.07
Head, Department of Multimedia Engineering, Dongguk University, 2021.02-2022.07
Chief of the youth entrepreneurship center at Dongguk University, 2020.2-2021.01
Head, Department of Computer Science & Engineering, Gangneung-Wonju National University, 2015-2016.8
PD, ABEEK, Department of Computer Science & Engineering, Gangneung-Wonju National University, 2015-2016.8
교육부 특성화사업 (융복합 신산업 소프트웨어 인재양성) – 운영위원, 2014-2016.8
- Associate Editor, International Journal of Cognitive Computing in Engineering, 2019.09-
- Associate Editor, Journal of Ambient Intelligence and Humanized Computing, 2018.09-
- Member of Editorial board, Interdisciplinary Sciences: Computational Life Sciences, 2018.09-
- Associate Editor, Human-centric Computing and Information Sciences (HCIS), 2014-2018.2
- Managing Editor, Journal of Information Processing Systems (JIPS), 2013-2018.2
- Managing Editor, Journal of Convergence, 2013-2018.2
Societal committees and leadership
- Korea Information Processing Society (KIPS), 2013-
- Task force memebr of IEEE CIS Task Force "Intelligent Agents", http://tfia.diem.unisa.it, 2017-
- IEEE member
Guest Editors
- "Novel Machine Learning Approaches for Intelligent Big Data", Symmetry, 2017
- "Advanced Computer Science and Applications for Soft Computing of Converged IT environments", Soft Computing, 2016
- "Advanced algorithms for Humanized Information Technologies and Applications", Journal of Ambient Intelligence and Humanized Computing, 2016
- "Advances in Next Era Cloud-Empowered Computing and Techniques", Journal of Supercomputing, 2015
General Chairs
- The 2018 World Congress on Information Technology Applications and Services (World IT Congress 2018)
- The 2018 Global Conference on Information Technology, Computing, and Applications (Global IT 2018)
- The 2017 Global Conference on Information Technology, Computing, and Applications (Global IT 2017)
- The 2017 World Congress on Information Technology Applications and Services (World IT Congress 2017)
- The 12th International Conference on Future Information Technology (FutureTech2017)
- The 2017 International Conference on Big data, IoT, and Cloud Computing (BIC 2017)
- The 12th KIPS International Conference on Ubiquitous Information Technologies and Applications (CUTE 2017)
- The 2016 Global Conference on Information Technology, Computing, and Applications (Global IT 2016)
- The 10th International Conference on Multimedia and Ubiquitous Engineering (MUE2016)
- The 5nd International Conference on Ubiquitous Computing Application and Wireless Sensor Network (UCAWSN-16)
- The 11th KIPS International Conference on Ubiquitous Information Technologies and Applications (CUTE 2016)
- The 2015 International Conference on Advanced Computing and Services (ACS 2015)
- Global IT 2015, Las Vegas, USA, January 13~15, 2015
Program Chairs
- 2018 IEEE International Symposium on Intelligent Agents (IA 2018)
- The 9th International Conference on Multimedia and Ubiquitous Engineering (MUE-15)
- The 4th FTRA International Conference on Ubiquitous Computing Applications and Wireless Sensor Network (UCAWSN-15)
- The 5th International Conference on Advanced Intelligent Mobile Computing (AIM 2015)
- The 10th FTRA International Conference on Future Information Technology (FutureTech 2014)
- The 9th FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE 2014)
- The 11th FTRA International Conference on Information Technology Convergence and Services (ITCS-14)
- The 11st FTRA International Conference on Secure and Trust Computing, data management, and Applications (STA 2014)
- The 2nd FTRA International Conference on. Ubiquitous Computing Application and Wireless Sensor Network (UCAWSN-14)
- The 2014 International Workshop on. Advanced Multimedia Computing (AMC-14)
- The 2014 FTRA International Conference on Advanced Computing and Services (ACS-14)
- The 2014 International Symposium on Frontier and Innovation in Future Computing and Communications (FCC-14)
- The 5th FTRA International Conference on Computer Science and its Applications (CSA-13)
- The 2013 International Congress on 3D IT, Communications, and Convergence(3DITCom 2013)
- The 2nd International Conference on Ubiquitous Context-Awareness and Wireless Sensor Network (UCZWSN-14)
Workshop Co-Chairs
- The 29th IEEE International Conference on Advanced Information Networking and Applications (AINA 2015)
Gene Annotation
AGORA: Annotator for Genome of Organelle from Referenced sequence Analysis
Author: Gangman Yi Date: Dec 15, 2019
Next-generation sequencing (NGS) technologies have led to the accumulation of high-throughput sequence data from various organisms in biology. To apply gene annotation of organellar genomes for various organisms, more optimized tools for functional gene annotation are required. Almost all gene annotation tools are mainly focused on the chloroplast genome of land plants or the mitochondrial genome of animals. We have developed a web application AGORA for the fast, user-friendly and improved annotations of organellar genomes. Annotator for Genes of Organelle from the Reference sequence Analysis (AGORA) annotates genes based on a basic local alignment search tool (BLAST)-based homology search and clustering with selected reference sequences from the NCBI database or user-defined uploaded data. AGORA can annotate the functional genes in almost all mitochondrion and plastid genomes of eukaryotes. The gene annotation of a genome with an exon–intron structure within a gene or inverted repeat region is also available. It provides information of start and end positions of each gene, BLAST results compared with the reference sequence and visualization of gene map by OGDRAW.
- In order to run the BLAST with query and references, you need to put reference sequences. If you have the user-defined amino acid or nucleotide sequences, you can upload it on the system. Otherwise, you can put the accesion number such as NC_000000.For more information, please refer NCBI site << errorpage
Input File Format :
- Input file is an aassembled contig. The format should be "FASTA". The AGORA allows only one assembled query contig. For more information about FASTA, please refer to NCBI site
Type :
- Select Choloroplast or Mitochondrion for your organellar genome
Genetic Code
- This code is used for running the tBLASTn
Standard
Vertebrate Mitochondrial
Yeast Mitochondrial
Mold Mitochondriali
Invertebrate Mitochondrial
Ciliate Nuclear
Echinoderm Mitochondrial
Euplotid Nuclear
Bacteria and Archaea
Alternative Yeast Nuclear
Ascidian Mitochondrial
Flatworm Mitochondrial
Blepharisma Macronuclear
For more information about genetic code please see NCBI
Output
Output :
- As you see below examples, output file is the BLAST result that includes amino acid and nucloetide. The Query is set to the refereces and Data base is set to query. The number of matched position is decided upon the "Maximum matched sub gene's count"
The blast result of amino acid
The blast result of nucleotide
Amino acid db sequences :
- This file includes the amino acid data base sequences. If the user uploaded the user-defined sequence, this file is same to that uploaded file. Otherwise, system is automatically generated from the NCBI.
Amino acid sequences :
- This is CDS translation files that is matched from the BLAST
output CSV file :
- This file provides the start and end position, direction and gene product for each gene.
Nucleotide db sequences :
- This file is nucleotide data base sequences.
Nucleotide sequences :
-The FASTA formatted seuqneces file is includes the BLAST mached sequences.
GenBank File format :
- This file is GenBank formatted file. With this file we draw the circular gene map by running OGDRAW
OGDRAW :
- If all genes are matched correctly, you can see the figure. Here is example
Genome Search Plotter
A Robust Method for Finding the Automated Best Matched Genes Based on Grouping Similar Fragments of Large-Scale References for Genome Assembly
Author: Gangman Yi Date: Sep 13, 2017
Big data research on genomic sequence analysis has accelerated considerably with the development of next-generation sequencing. Currently, research on genomic sequencing has been conducted using various methods, ranging from the assembly of reads consisting of fragments to the annotation of genetic information using a database that contains known genome information. According to the development, most tools to analyze the new organelles’ genetic information requires different input formats such as FASTA, GeneBank (GB) and tab separated files. The various data formats should be modified to satisfy the requirements of the gene annotation system after genome assembly. In addition, the currently available tools for the analysis of organelles are usually developed only for specific organisms, thus the need for gene prediction tools, which are useful for any organism, has been increased. The proposed method—termed the genome_search_plotter—is designed for the easy analysis of genome information from the related references without any file format modification. Anyone who is interested in intracellular organelles such as the nucleus, chloroplast, and mitochondria can analyze the genetic information using the assembled contig of an unknown genome and a reference model without any modification of the data from the assembled contig.
Reference accesion Number :
- In order to run the BLAST with query contigs and reference, Reference sequences are required. For more information, please refer to NCBI site
Input File Format :
- Input file is the assembled contigs and format should be in FASTA format. The number of contigs are not limimted, but, it will takes several hours. We recommend to copy the reslut page URL for your reference.
Maximum number of matched BLAST hit group :
- The number sets the maximum groups which is based on the BLAST result.
Minimum number of matched sub gene's count per each contig :
- Please refer to below Figure 1-B.
OUTPUT
The provided files are "sorted sequences" and "PDF" files
Sorted by the number of subgene :
- BLAST result file which is sorted by BLAST e-value. In the sorted query sequences file, sequences are sorted by the number of matched sub-genes and the sequences that do not meet the minimum value of k are filtered out
PDF file:
- The results are shown as a graph depicting matches with the 149 reference genomes on the X-axis and query sequences uploaded by the user on the Y-axis. Each 150 line on the plot indicates that a query contig is matched with the reference sequences.
geneCo
geneCo: A visualized comparative genomic method to analyze multiple genome structures method to analyze multiple genome structures
Author: Gangman Yi Date: Jan 27, 2019
In comparative and evolutionary genomics, a detailed comparison of common features between organisms is essential to evaluate genetic distance. However, identifying differences in matched and mismatched genes among multiple genomes is difficult using current comparative genomic approaches due to complicated methodologies or the generation of meager information from obtained results. This study describes a visualized software tool, geneCo (gene Comparison), for comparing genome structure and gene arrangements between various organisms. User data are aligned, gene information is recognized, and genome structures are compared based on user-defined GenBank files. Information regarding inversion, gain, loss, duplication, and gene rearrangement among multiple organisms being compared is provided by geneCo, which uses a web-based interface that users can easily access without any need to consider the computational environment.
Identifying clusters of functionally related genes in genomes
Author: Gangman Yi Date: May 1, 2007
C-Hunter is a new clustering algorithm which incorporates knowledge of gene function derived from Gene Ontology, with the organization of genes on chromosomes. In order to use C-Hunter program, basic data sets are needed. All data sets for eight species(AT,CE,DM,DR,EC,HS,MM & SC), Data/Map file, GO, gene2accession and gene2go are supplied with C-Hunter program together. But, if you want to use new data sets, you can download from each website and you can make them again. C-Hunter program can be compiled under the Unix/Linux/Windows(Cygwin) environment, if the compiler supports STL.
Installation
tar -zxvf C_Hunter_v.1.2.tar.gz
cd C_Hunter_v.1.2
./install
Procedure for preparing data sets
If you are going to use alreay-made data sets, you don't need to do this procedure.
1. Download go.obo text file (include Molecular Function, Biological Process, Cellular) at http://www.geneontology.org/page/download-ontology
2. Download gene2accession at NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
3. Download gene2go at NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
4. Make Chromosome list file (text file by tab-separated format)
1st_Column = Chromsome Number
2nd_Column = Accession
3rd_Column = GI
5. Run "obo2scheme.py" with 1 Gene Ontology data
Usage : python obo2scheme.py go.obo (file from #1)
The output file name will be "scheme.GO.data"
6. Run "Make_GO_Map_Data.php" with gene2accession, gene2go and Chr_list
Usage : php Make_GO_Map_Data.php gene2accession, gene2go, Chr_list, Ouput file name
This converting program makes map files and GO data files.
Supervised Protein Family Classification and New Family Construction
Author: Gangman Yi Date: Aug 9, 2012
SClassify is a supervised protein family classification algorithm that overcomes the problems of existing supervised and unsupervised algorithms and achieves much improved accuracy. It can assign proteins to existing families in databases, and by taking into account similarities between the unclassified proteins, can assign them to new families.
Installation
The SClassify source code, including sample input and output files, can be compiled under the Unix/Linux/Windows(Cygwin) environment. The following steps will create a directory called sclassify. Detailed usage of SClassify is provided in a README file.
The program assumes that e-values between each unclassified protein and each
protein in existing families and e-values between each pair of unclassified
proteins have already been obtained by other software such as BLAST or SSEARCH.
The following files are needed:
1. A file that lists the name of each protein in existing families along with
the name of its family in a two-column tab-separated format (example file:
pfam.list).
2. A file that lists the name of each unclassified protein in a one-column
format (example file: test.list).
3. A file that lists the e-values between each unclassified protein and each
protein in existing families in a three-column tab-separated format that
gives the name of an unclassified protein, the name of a protein in an
existing family, and the e-value between them. There is no need to have
an e-value for each pair if some of them are missing. The file is optional
(example files: blast/test_pfam.score, ssearch/test_pfam.score).
4. A file that lists the e-values between each pair of unclassified proteins
in a three-column tab-separated format that gives the names of two
unclassified proteins and the e-value between them. There is no need to
have an e-value for each pair if some of them are missing. The file is
optional (example files: blast/test_test.score, ssearch/test_test.score).
USAGE
./sclassify -c infile1 -u infile2 -p infile3 -n infile4 -e cutoff -o outfile
where infile1 to infile4 are the input files described above, cutoff is the e-value cutoff, and outfile is the output file.
OUTPUT
The output file is in a two-column tab-separated format that lists the name
of each protein that is classified and the name of its assigned family. A
distinct name is generated for each new family, and the same name is used for
all proteins that are classified to the same family.
SCRIPTS
Two scripts are provided to convert the results from BLAST and from SSEARCH to
a three-column tab-separated format.
1. BLAST converter
Usage: python convert_blast.py infile outfile
where infile contains the results from BLAST, and outfile is the output
file.
cPlot: Contig-Plotting Visualization for the Analysis of Short-Read Nucleotide Sequence Alignments
Author: Gangman Yi Date: Sep 21, 2022
Advances in the next-generation sequencing technology have led to a dramatic decrease in read-generation cost and an increase in read output. Reconstruction of short DNA sequence reads generated by next-generation sequencing requires a read alignment method that reconstructs a reference genome. In addition, it is essential to analyze the results of read alignments for a biologically meaningful inference. However, read alignment from vast amounts of genomic data from various organisms is challenging in that it involves repeated automatic and manual analysis steps. We, here, devised cPlot software for read alignment of nucleotide sequences, with automated read alignment and position analysis, which allows visual assessment of the analysis results by the user. cPlot compares sequence similarity of reads by performing multiple read alignments, with FASTA format files as the input. This application provides a web-based interface for the user for facile implementation, without the need for a dedicated computing environment. cPlot identifies the location and order of the sequencing reads by comparing the sequence to a genetically close reference sequence in a way that is effective for visualizing the assembly of short reads generated by NGS and rapid gene map construction.