Named entity recognition

The named entity recognition (NER) module recognizes mention spans of a particular entity type (e.g., abnormal findings) from the reports. RadText provides two sub-modules for NER.

ner:regex

The rule-based method uses regular expressions that combine information from terminological resources and characteristics of the entities of interest. They are manually constructed by domain experts.

Options

Option name

Default

Description

–phrase

$resources/cxr14_phrases_v2.yml

Phrase patterns

Example Usage

$ radext-ner regex --phrase /path/to/patterns.yml -i /path/to/input.xml -o /path/to/output.xml
from pathlib import Path
from radtext.models.ner.ner_regex import NerRegExExtractor, BioCNerRegex
from radtext.cmd.ner import load_yml

patterns = load_yml(argv['--phrases'])
extractor = NerRegExExtractor(patterns)
processor = BioCNerRegex(extractor, name=Path(argv['--phrases']).stem)

Phrase patterns

The pattern file is in the yaml format. It contains a list of concepts where the key serves as the preferred name. Each concept should contain three attributes: concept_id, include, and exclude. include contains the regular expressions that the concept will match. exclude contains the regular expressions that the concept will not match, even if its substring will match the regular expressions in the include

Using the following example, RadText will recognize “emphysema”, but reject “subcutaneous emphysema” though “emphysema” is part of “subcutaneous emphysema”.

Emphysema:
  concept_id: RID4799
  include:
    - emphysema
  exclude:
    - subcutaneous emphysema

ner:spacy

Options

Option name

Default

Description

–radlex

$resources/Radlex4.1.xlsx

The RadLex ontology file

–spacy-model

en_core_web_sm

The spaCy model

Example Usage

$ radext-ner spacy --radlex /path/to/Radlex4.1.xlsx -i /path/to/input.xml -o /path/to/output.xml
import spacy
from radtext.models.ner.ner_spacy import NerSpacyExtractor, BioCNerSpacy
from radtext.models.ner.radlex import RadLex4

nlp = spacy.load(argv['--spacy-model'], exclude=['ner', 'parser', 'senter'])
radlex = RadLex4(argv['--radlex'])
matchers = radlex.get_spacy_matchers(nlp)
extractor = NerSpacyExtractor(nlp, matchers)
processor = BioCNerSpacy(extractor, 'RadLex')