Go to Google Home
A data-code-compute resource for research and education in information visualization
InfoVis Home Learning Modules Software Databases Compute Resources References

Databases > National Institute of Health (NIH) Grant Award Data

Description | Origins | Data Format and Size| Data Quality | Data Cleaning | Acknowledgments


The National Institute of Health, which is comprised of 27 institutes and centers, is an agency of the Department of Health and Human Services in US. It awards grants for the support of basic or clinical biomedical, behavioral, and bioengineering research.


The CRISP (Computer Retrieval of Information on Scientific Projects) is a searchable database of federally funded biomedical research projects conducted at universities, hospitals, and other research institutions. The database, maintained by the Office of Extramural Research at the National Institutes of Health, includes projects funded by the National Institutes of Health (NIH), Substance Abuse and Mental Health Services (SAMHSA), Health Resources and Services Administration (HRSA), Food and Drug Administration (FDA), Centers for Disease Control and Prevention (CDCP), Agency for Health Care Research and Quality (AHRQ), and Office of Assistant Secretary of Health (OASH).

Information on NIH Award amounts is available at the Award Data web site.

Data Format

Raw Data:
Please query the NIH awards data base via CRISP to get familiar with this data set.

Data Fields:

  • Grant Number number
  • PI First Name varchar2(1000)
  • PI Middle Name varchar2(1000)
  • PI Last Name varchar2(1000)
  • PI Email varchar2(1000)
  • PI Title varchar2(4000)
  • Project Title varchar2(2550)
  • Abstract clob
  • Thesaurus Terms varchar2(4000)
  • Institution Name varchar2(2000)
  • Institution Address varchar2(2000)
  • Institution City varchar2(2000)
  • Institution State varchar2(500)
  • Institution Zipcode1 number
  • Institution Zipcode2 number
  • Institution Country varchar2(500)
  • Fiscal Year date
  • Department varchar2(400)
  • Project Start date
  • Project End date
  • Institues Centers Divisions (ICD) varchar2(400)
  • Integrated Review Group (IRG) varchar2(4000)
  • Amount number
  • Keywords varchar2(255)
  • data_is_ok char

Years covered: 1972-2004, total 1,028,521records (detailed statistics)

Storage Space Required:
N umber of records per year years = 70,000. Estimated number of total records in 2005 = 1,030,000.
Approximately 2.3 GB of raw data by 2005.

Data Quality

There are missing

  • PI_Email
  • PI_Title
  • Institution
  • Department
  • "pipe delimiters" - make some rows show less than 14 columns
  • There "duplicates" same information BUT with different Grant_Number (e.g. data00.txt#44899, data00.txt#44900)
  • Some "duplicates" are the lesser version of its counterpart (e.g. data00.txt#44818, data00.txt#44819 - data00.txt does NOT have an abstract information)
  • Abstract - there are many records which have the word "DESCRIPTION" in front. Missing abstracts are identified as "This abstract is not available" OR "There is no text on file for this abstract".
  • Non existent thesaurus terms are identified with "There are no thesaurus terms on file for this project"
  • Street Address information is in TAB-delimited format
  • Many records were missing the final ("IRG") field. Instead of pipe delimiters in some records, there were occurrences of row headers as delimiters, such as "Project Start," "Project End," "ICD," and "IRG."

Detailed statistics on missing data will be compiled when the data is uploaded into Oracle.

Data Cleaning

Records which were missing the final ("IRG") field had a blank field added to get the correct record length. Also, row header delimiters (such as "ICD" and "IRG" as mentioned above) were replaced by pipe delimiters (|) in as many cases as possible.


This data set description was compiled by Jay Askren, Saiful Bahari, Chris Friend, Katy Börner and Caroline Courtney.

Information Visualization CyberInfraStructure @ SLIS, Indiana University
Last Modified June 04, 2004