|||
List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources. Available online at https://shubhanshu.com/awesome-scholarly-data-analysis/
Table of contents generated with markdown-toc
Open Academic Graph - MAG + AMiner
OpenAIRE Research Graph - More info here
Humanities and multilingual citation string parsing Flux-CiM and ICONIP see Neural ParsCit paper for details
Citation string parsing data for social sciences for English and German citations - comparison with Grobid and Cermine
Sherpa/Romeo (Publisher copyright policies & self-archiving)
Fatcat - versioned, publicly-editable catalog of research publications
Self-citation analysis data based on PubMed Central subset (2002-2005)
A dataset of publication records for Nobel laureates - paper
OpenAIRE Scholexplorer - 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects - About the data
MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML
Semantic Scholar Graph of References in Context (GORC) dataset
SciMag - Microsoft Academic Linked to SciMago Journals - WebPage
Citations to scholarly data in various language wikipedias Code
S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
Dataset Search: metadata for datasets - Datasets with DOIs and compact identifiers
PeerRead - paper drafts, reviews, and accept/reject decision
CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations - Contact Author
APE: Argument Pair Extraction - Annotated ICLR 2013-2020 review-rebuttal argument pair
Publons review length dataset with 498K reviews - anonymized
Peer review analyze: A novel benchmark resource for computational analysis of peer reviews
Open Editors: data about scholarly journals’ editors and editorial board members - Github
SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics
MENTORSHIP - A dataset of mentorship in science with semantic and demographic estimations - Code
MapAffil 2016 dataset – PubMed author affiliations mapped to cities and their geocodes worldwide
Data from the CVs of over 150 assistant professors in psychology in top-ranked research universities and small liberal art colleges in the US - Used in this blog
Career long various citation metrics for 100,000 top-scientists
Base data for estimating precision and recall of Author-ity among NIH-funded scientists
ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale
S2AND - Semantic Scholar Author Name Disambiguation Tool and Dataset
The Networked Digital Library of Theses and Dissertations (NDLTD)
Peer-making: the interconnections between PhD Thesis Committee membership and co-publishing - Zenodo
DISAPERE: A Dataset for DIscourse Structure in Academic PEer REview
ACL RD TEC 2.0 also at @CLARIN
Colorado Richly Annotated Full-Text - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms.
SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts
PubMed200k_RCT - Label abstract sentences into Objective, Background, Method, Results, Conclusions
SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers
A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web
Corpus of 40 scientific papers manually annotated by multiple scientific discourse facets
PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track - Train - Dev - Test - Background Test set
Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction
The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation
SeminalSurveyDBLP - Classification of seminal or survey papers
Supp.ai - PubMed supplement-drug interactions and supplement-supplement interactions
GENETAG - More recent versions Publication and Download 2005
Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext - Also see other NLM curated biomedical resources
SciDTB: Discourse Dependency TreeBank for Scientific Abstracts
Dr. Inventor Multi-layer Scientific Corpus for multiple scientific discourse facets
ART corpus - 225 papers manually annotated the CISP labels (i.e. “Goal”, “Method”, “Result”).- Browse files - Project details
Multi-CoreSC CRA corpus (MCCRA) - 50 papers annotated with multiple CoreSC labels per sentence. - Project details
NeuroQuery - 14,000 full-text publications and 400,000 peak activations - NeuroQuery website
Annotated Corpus of Scientific Conference’s Homepages for Information Extraction
A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology
A manual corpus of annotated main findings of clinical case reports
Lots of biomedical entity linking and entity identification datasets
Materials Science Named Entity Recognition: train/development/test sets
Named Entity Recognition for Bacterial Type IV Secretion Systems - Paper
Annotating and detecting phenotypic information for chronic obstructive pulmonary disease
MiRoR11 - P2 - Annotated corpus for primary and reported outcomes extraction
Data from: PGxCorpus, a Manually Annotated Corpus for Pharmacogenomics
The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text - SPECIES Direct Download - ORGANISMS Direct Download
Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognition
Relation Extraction: (2.5 MB), 2 datasets on biomedical relation extraction
Question Answering: (5.23 MB), 3 datasets on biomedical question answering task
SciREX : A Challenge Dataset for Document-Level Information Extraction
Papers with Code - Links between papers and repositories and extraction of SOTA results
S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
NLPContributionGraph - Structuring Scholarly NLP Contributions in the Open Research Knowledge Graph
The General Index - Metadata, Ngrams, and Keyphrases in 107,233,728 journal articles
NLMChem a new resource for chemical entity recognition in PubMed full-text literature
Annotated scientific findings with sentence-level and aspect-level certainty
SoftwareKG_Social and SoftwareKG_PubMed - Software mentions in articles
Bioinformatics Named Entity Recogniser for Databases and Software
The CodeMeta Project: preservation, discovery, reuse, and attribution of software
SCIERC: Multi-Task Identification of Entities, Relations, and Coreferencefor Scientific Knowledge Graph Construction - Code
multimodal_summ: Multimodal summarization of research papers
Entity Linking of Crossref Funding Orgs in Acknowledgements - paper
Microsoft Academic Knowledge Graph (MAKG) - Zenodo ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences
I3 Open Innovation Dataset Index - Multiple datasets related to patent networks, inventor careers, etc.
Science4cast Competition - capture the evolution of scientific concepts and predict which research topics will emerge in the coming years
Medical Subject Headings maintained by the National Library of Medicine of the United States
Computer Science Ontology maintained by Scholarly Knowledge: Modeling, Mining and Sense Making
Physics Subject Headings (PhySH) maintained by American Physical Society (APS) GitHub
Open Biological and Biomedical Ontology (OBO) maintained by the OBO Foundry
ACM Computing Classification System maintained by the Association for Computing Machinery
Physics and Astronomy Classification Scheme (PACS) maintained by American Institute of Physics (AIP) discontinued in 2010 and replaced by Physics Subject Headings
Mathematics Subject Classification (MSC) mantained by Mathematical Reviews and zbMATH
Journal of Economic Literature (JEL) maintained by the American Economic Association
STW Thesaurus for Economics maintained by ZBW - Leibniz Information Centre for Economics
Australian and New Zealand Standard Research Classification (ANZSRC) maintained by Australian Bureau of Statistics, it consists of 3 sub-classification schemes:
Fields of Research (FoR) classification
Research Fields, Courses and Disciplines (RFCD) classification
Socio-Economic Objective (SEO) classification
Library of Congress Classification (LCC) maintained by Library of Congress
Fields of Study (FoS) maintained by Microsoft Academic
Scientific Keyphrase Extraction Datasets - KP20k, NUS, MAG_KP
CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation
BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)
PublicationHarvester - Download PubMed publications of an author
Publish or Perish - retrieves and analyzes academic citations from MS Academic and Scholar
Data Set Knowledge Graph (DSKG) - a RDF data set about data sets
Quantitative Science Studies (Open Access)
International Conference on Theory and Practice of Digital Libraries (TPDL)
European Semantic Web Conference (ESWC), Research of Research Track
STI Conference series (Science and Technology indicators, e.g., 2018)
ISSI Conference series (INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, e.g., 2019)
International Society for Informetrics and Scientometrics (ISSI)
SIG/MET - Special Interest Group for the measurement of information production and use
The following people have contributed to the items on this list.
Shubhanshu Mishra - Maintainer of the list.
http://shubhanshu.com/awesome-scholarly-data-analysis/
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-29 20:54
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社