Characterization of a corpus extracted from maternal electronic health records through natural language processing techniques
Abstract
The purpose of this article was to characterize the free text available in an electronic health record of an institution, directed at the care of patients in pregnancy. More than being a data repository, the electronic health record (HCE) has become a clinical decision support system (CDSS). However, due to the high volume of information, as some of the key information in EHR is in free text form, using the full potential that EHR information offers to improve clinical decision-making requires the support of methods of text mining and natural language processing (PLN). Particularly in the area of gynecology and obstetrics, the implementation of PLN methods could help speed up the identification of factors associated with maternal risk. Despite this, in the literature there are no papers that integrate PLN techniques in EHR associated with maternal follow-up in Spanish. Taking into account this knowledge gap, in this work a corpus was generated and characterized from the EHRs of a gynecology and obstetrics service characterized by treating high-risk maternal patients. PLN and text mining methods were implemented on the data, obtaining 659,789 tokens and a dictionary with unique words given by 7,334 tokens. The characterization of the data was developed from the identification of the most frequent words and n-grams and a vector representation of embedding words in a 300-dimensional space was performed using a CBOW (Continuous Bag Of Words) neural network architecture. The embedding of words allowed to verify by means of Clustering algorithms, that the words associated to the same group can come to represent associations referring to types of patients, or group similar words, including words written with spelling errors. The corpus generated and the results found lay the foundations for future work in the detection of entities (symptoms, signs, diagnoses, treatments), correction of spelling errors and semantic relationships between words to generate summaries of medical records or assist the follow-up of mothers through the automated review of the electronic health record.
Key words: Natural language processing; electronic health record; machine learning; word embedding; artificial neural networks.
Downloads
Published
How to Cite
Issue
Section
License
Aquellos autores que tengan publicaciones con esta revista, aceptan los términos siguientes:
- Los autores conservarán sus derechos de autor y garantizarán a la revista el derecho de primera publicación de su obra, el cuál estará simultáneamente sujeto a la Licencia Atribución-NoComercial 4.0 Internacional (CC BY-NC 4.0) que permite a terceros compartir la obra siempre que se indique su autor y su primera publicación esta revista.
- Los autores podrán adoptar otros acuerdos de licencia no exclusiva de distribución de la versión de la obra publicada (p. ej.: depositarla en un repositorio institucional o publicarla en un volumen monográfico) siempre que se indique la publicación inicial en esta revista.
- Se permite y recomienda a los autores difundir su obra a través de Internet (p. ej.: en repositorios institucionales o en su página web) antes y durante el proceso de envío, lo cual puede producir intercambios interesantes y aumentar las citas de la obra publicada. (Véase El efecto del acceso abierto). En ese caso, solicitamos que en la cabecera del manuscrito se indique:"Esta es una versión preprint enviada a la Revista Cubana de Información en Ciencias de la Salud http://rcics.sld.cu/"
ENGLISH VERSION
AUTHORS WITH PUBLICATIONS IN THIS JOURNAL ACCEPT THE FOLLOWING TERMS:
- Authors will retain their copyright and will grant the Journal the right of first publication of their work, which will also be subject to a Creative Commons License Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) allowing third parties to share the work as long as the author's name and data about initial publication in this Journal are stated.
- Authors may adopt other license agreements for non-exclusive distribution of the version of the work published (e.g. deposit it in an institutional repository or publish it in a monographic volume), as long as initial publication in this Journal is indicated.
- It is permitted and recommended for authors to disseminate their work on the Internet (e.g. in institutional repositories or their web page) before and during the submission process, which may result in interesting exchanges and increase the number of citations of the published work) (see The effect of open access).