ANNOTATION, MINING AND RETRIEVAL OF TCM PATENTS


laptop

Na Deng

Hubei University of Technology

Copyright © 2019 by Cayley Nielson Press, Inc.

ISBN: 978-1-5323-9663-2

Cayley Nielson Press Scholarly Monograph Series Book Code No.: 209-4-2

US$162.50

 

 

 

 

 

Preface


Traditional Chinese medicine (TCM) is becoming more and more popular all over the world due to its mild properties and strong therapeutic effects, and the number of TCM patent applications is increasing year by year. As a special carrier, TCM patents contain a lot of legal, technical, medical and other information. The effective use of these information will greatly promote the development of traditional Chinese medicine. In the analysis of TCM patents, an important step is to annotate the key information, which is the key word of patent analysis, mining and retrieval, and also the valuable training set of machine learning algorithms. This book will focus on the annotation, mining and retrieval of TCM patent texts, in order to provide ideas for other analysis and processing of TCM patents.


This book is supported by National Key Research and Development Program of China under Grant 2017YFC1405403; National Natural Science Foundation of China under Grant 61075059; Philosophical and Social Sciences Research Project of Hubei Education Department under Grant 19Q054; Green Industry Technology Leding Project (product development category) of Hubei University of Technology under Grant CPYF2017008; Research Foundation for Advanced Talents of Hubei University of Technology under Grant BSQD12131; Natural Science Foundation of Anhui Province under Grant 1708085MF161; and Key Project of Natural Science Research of Universities in Anhui under Grant KJ2015A236.


As the co-instructors, Professor Zhiwei Ye and Caiquan Xiong devoted a lot of efforts. Because the situation and the level is limited, the selection and evaluation of the methods are inappropriate or even wrong department, I am thankful that the readers could give criticism and correction.

Deng Na
Hubei University of Technology
Wuhan, China
Sept 10, 2019

 

 

Contents


 

PREFACE......................................................................................................................................... 1
1 PATENT........................................................................................................................................ 7
1.1 WHAT IS PATENT?................................................................................................................. 7
1.2 CHARACTERISTICS OF PATENTS................................................................................... 8
1.2.1 CHARACTERISTICS OF STRUCTURE........................................................................ 8
1.2.2 CHARACTERISTICS OF PATENT LANGUAGE........................................................ 9
1.3 ABSTRACT TEXT OF PATENT......................................................................................... 10
2 ANALYSIS AND MINING OF PATENTS......................................................................... 11
2.1 BACKGROUND..................................................................................................................... 11
2.2 INFORMATION EXTRACTION IN PATENTS............................................................... 12
2.3 PATENT PREDICTION........................................................................................................ 14
2.4 PATENT CLASSIFICATION............................................................................................... 14
2.5 PATENT CLUSTERING....................................................................................................... 15
2.6 PROBLEMS IN MINING OF CHINESE PATENTS....................................................... 16
3 TCM AND TCM PATENTS................................................................................................... 17
3.1 TCM.......................................................................................................................................... 17
3.2 TCM PATENTS...................................................................................................................... 17
4 ANNOTATION OF FOUR CHARACTER MEDICINE EFFECT PHRASES............. 19
4.1 PROBLEM DESCRIPTION................................................................................................. 19
4.2 FOUR CHARACTER MEDICINE EFFECT PHRASES................................................ 23
4.3 CHARACTERISTICS OF FOUR CHARACTER MEDICINE EFFECT PHRASES 23
4.3.1 POSITION CHARACTERISTICS.................................................................................. 23
4.3.2 PART OF SPEECH CHARACTERISTICS................................................................... 23
4.4 THE METHOD....................................................................................................................... 24
4.4.1 DEFINITION....................................................................................................................... 24
4.4.2 THE IDEA OF OUR METHOD....................................................................................... 25
4.4.3 INTERFERENCE ITEMS................................................................................................. 26
4.4.4 FLOW CHART.................................................................................................................... 26
4.4.5 PSEUDO CODES OF OUR METHOD.......................................................................... 28
4.5 EXPERIMENT AND ANALYSIS....................................................................................... 31
4.5.1 DATA SOURCE.................................................................................................................. 31
4.5.2 COLLECTION OF CHINESE HERBAL MEDICINE NAMES AND STOP WORDS       31
4.5.3 INITIAL SEED WORDS................................................................................................... 33
4.5.4 ITERATIONS...................................................................................................................... 34
4.6 CONCLUSIONS..................................................................................................................... 35
5 ANNOTATION OF DISEASE NAMES............................................................................... 36
5.1 PROBLEM DESCRIPTION................................................................................................. 36
5.2 CHARACTERISTICS OF DISEASE NAMES IN TCM PATENT TEXTS................. 38
5.3 CLUE WORDS....................................................................................................................... 38
5.4 METHOD................................................................................................................................. 39
5.4.1 PREPROCESSING............................................................................................................. 41
5.4.2 COLLECTION OF DDISEASE NAME SEEDS.......................................................... 43
5.4.3 COLLECTION OF CLUE WORDS................................................................................ 43
5.4.4 CALCULATION OF THE WEIGHTS OF CLAUSES................................................ 44
5.4.5 STORAGE OF WEIGHTS OF CLAUSES.................................................................... 46
5.4.6 CREATION OF CANDIDATE CLAUSES.................................................................... 46
5.5 EXPERIMENTS..................................................................................................................... 48
5.6 CONCLUSION....................................................................................................................... 48
6 CLUSTERING OF FOUR CHARACTER MEDICINE EFFECT PHRASES............... 49
6.1 PROBLEM DESCRIPTION................................................................................................. 49
6.2 FOUR CHARACTER MEDICINE EFFECT PHRASES................................................ 50
6.2.1 CHARACTERSISTICS OF WORD-BUILDING......................................................... 51
6.2.2 CHARACTERISTICS OF PART OF SPEECH COMBINATION............................. 52
6.3 SIMILARITY CALCULATION OF PHRASES............................................................... 53
6.4 K-CENTROID CLUSTERING ALGORITHM................................................................ 54
6.5 EXPERIMENT....................................................................................................................... 55
6.5.1 DATA SOURCE.................................................................................................................. 55
6.5.2 SIMILARITY...................................................................................................................... 56
6.5.3 CLUSTERING RESULT................................................................................................... 58
6.6 CONCLUSION AND FUTURE WORK............................................................................ 62
7 THE COLLECTION OF STOP WORDS IN TCM PATENTS........................................ 63
7.1 STOP WORDS........................................................................................................................ 63
7.2 METHOD................................................................................................................................. 63
7.3 DATA SOURCE...................................................................................................................... 65
7.4 EXPERIMENT....................................................................................................................... 65
8 SEMANTIC SIMILARITY COMPUTATION OF TCM PATENTS.............................. 70
8.1 PROBLEM DESCRIPTION................................................................................................. 70
8.2 METHOD................................................................................................................................. 75
8.2.1 COLLECTION OF STOP WORDS IN TCM PATENTS............................................ 76
8.2.2 TRAINING OF WORD2VEC MODEL......................................................................... 79
8.2.3 SEMANTIC SIMILARITY CALCULATION............................................................... 80
8.3 EXPERIMENTS..................................................................................................................... 81
8.5 CONCLUSION....................................................................................................................... 82
9 CLUSTERING ANALYSIS AND VISUALIZATION OF TCM PATENTS................ 83
9.1 PROBLEM DESCRIPTION................................................................................................. 83
9.2 CLUSTERING OF TCM PATENTS................................................................................... 85
9.2.1 REPRESENTATION OF TCM PATENTS..................................................................... 85
9.2.2 FLOW CHART OF CLUSTERING ALGORITHM..................................................... 86
9.2.3 PREPROCESSING............................................................................................................. 89
9.2.4 GENERATION OF DATA SET........................................................................................ 91
9.2.5 TRAINING OF DOC2VEC NEURAL NETWORK.................................................... 91
9.2.6 K-MEANS CLUSTERING............................................................................................... 92
9.2.7 STORAGE OF CLUSTERING RESULTS.................................................................... 93
9.3 VISUALIZATION OF TCM PATENTS............................................................................. 93
9.4 CONCLUSION....................................................................................................................... 95
REFERENCES.............................................................................................................................. 96


 

Readership


This book should be useful for students, scientists, engineers and professionals working in the areas of optoelectronic packaging, photonic devices, semiconductor technology, materials science, polymer science, electrical and electronics engineering. This book could be used for one semester course on adhesives for photonics packaging designed for both undergraduate and graduate engineering students.

 

Originality and Plagiarism

Prospective authors should note that only original and previously unpublished manuscripts will be considered. The authors should ensure that they have written entirely original works, and if the authors have used the work and/or words of others, that this has been appropriately cited or quoted. Furthermore, simultaneous submissions are not acceptable. Submission of a manuscript is interpreted as a statement of certification that no part of the manuscript is copyrighted by any other publication nor is under review by any other formal publication. It is the primary responsibility of the author to obtain proper permission for the use of any copyrighted materials in the manuscript, prior to the submission of the manuscript.