An increasing amount of research, particularly in medicine and applied science, is now based on meta-analysis and sytematic review of the existing literature (example). In such reviews scientists frequently download thousands of articles and analyse them by Natural Language Processing (NLP) through Text and Data Mining (TDM) or Content Mining (ref). A common approach is to search bibliographic resources with keywords, download the hits, scan then manually and reject papers that do not fit the criteria for the meta-analysis. The typical text-based searches on sites are broad, with many false positives and often only based on abstracts. We know of cases where systematic reviewers downloaded 30,000 articles and eventually used 30. Retrieval is often done by crawling / scraping sites, such as journals but is easier and faster when these articles are in Open Access repositories such as arXiv, Europe/PMC biorxiv, medrxiv. But each repository has its own API and functionality, which makes it hard for individuals to (a) access (b) set flags (c) use generic queries.
In 2015 we reviewed tools for scraping websites and decided that none met our needs and so developed getpapers, with the key advance of integrating a query submission with bulk fulltext-download of all the hits. getpapers was written in NodeJs and has now been completely rewritten in Python3 (pygetpapers) for easier distribution and integration. Typical use of getpapers is shown in a recent paper where the authors "analyzed key term frequency within 20,000 representative [Antimcrobial Resistance] articles".
Unsupervised entity extraction from sections of papers that have defined boilerplates. Examples of such sections include - Ethics Statements, Funders, Acknowledgments, and so on.
Extracting Ethics Committees and other entities related to Ethics Statements from papers
Curating the extracted entities to public databases like Wikidata
Building a feedback loop where we go from unsupervised entity extraction to curating the extracted information in public repositories to then, supervised entity extraction.
The use case can go beyond Ethics Statements. docanalysis is a general package that can extract relevant entities from the section of your interest.
Sections like Acknowledgements, Data Availability Statements, etc., all have a fairly generic sentence structure. All you have to do is create an ami dictionary that contains boilerplates of the section of your interest. You can, then, use docanalysis to extract entities. Check this section which outlines steps for creating custom dictionaries. In case of acknowledgements or funding, you might be interested in the players involved. Or you might have a use-case which we might have never thought of!
This project analyzes the effect of changing certain traffic conditions on the formation of congestion
In order to simulate the traffic, we utilized Monte Carlo simulations to simulate a sense of randomness on the road
We added functionality like the creation of junctions, the widening of roads, etc. to see how they would impact the congestion on the roads
This project received the 3rd position in the research hackathon