Go to Google Home
A data-code-compute resource for research and education in information visualization
InfoVis Home Learning Modules Software Databases Compute Resources References

Databases > Patent Data

Description | Origins | Data Format and Size | Data Quality | Data Cleaning | Acknowledgments


For over 200 years, the United States Patent and Trademark Office (USPTO) has been processing and disseminating patent and trademark applications and information to promotes an understanding of intellectual property protection and to facilitate the development and sharing of new technologies worldwide. The office is a federal agency in the Department of Commerce and employs over 6,500 full time staff.


Patent data prior to 1996 was generously made available by Steven A.Morris, Electrical and Computer Engineering, Oklahoma State University. Patent data from 1996 to present can be downloaded from ftp://ftp.uspto.gov/pub/patdata/. Patent updates are released once a week on Tuesday.

Data Format

Raw Data:
Please query the USPTO databases and examine the US patent classification hierarchy to get familiar with this data set.

The patent bibliographic raw data is available as one zipped file for each weekly issue, beginning with week 36 of 1996. Within each zip file, the data appears in "PTO Green Book" format as concatenated 81-character, fixed-length, linefeed -terminated ASCII records. Each file is approximately 2 to 3 MB zipped, and unzips to a single 20 to 30 MB ASCII file.

Data Fields:
type varchar2(4000)
ocl_thesaurus_class varchar2(4000)
ocl_thesaurus_subclass varchar2(4000)
data_is_ok char
xcl_thesaurus_class varchar2(4000)
xcl_thesaurus_subclass varchar2(4000)
doc_id number
name varchar2(2000)
address varchar2(2000)
city varchar2(2000)
state varchar2(500)
zipcode1 number
zipcode2 number
country varchar2(500)
last_name varchar2(1000)
middle_name varchar2(1000)
first_name varchar2(1000)
date_published date
title varchar2(2550)
full_text clob
type varchar2(100)
abstract clob
author_seq number

There are a total of 5,402,657 authors (non-unique). Of these, 1,757,094 authors *seem to be* unique. A lot of these are 'middle initial missing' kind of cases. Hence, there should be no more than 1,200,000 unique authors.
There are a total of 22,650,056 citation links (for the 2582647 records from 1976 to Feb 2003).

Please read the detailed statistics to learn more about the coverage and completeness of data.

Storage Space Required:
Number of Entries: 2,582,647. For the years 1963-2005 we estimate a total of 2,640,000 and a size of 350 MB.
( We currently have 2,582,647 patent records for the years 1976 - Feb 2003. The years 2003-2005 should account for another 55,000 patent records).

See also Kevin Boyack's yearly statistics.

Data Quality
  • All patents have titles and patentIDs; all but 3 have date of issue. So the absolute essentials are in place.
  • Around 2.3% do not have citation information - it is possible that these patents really did not cite any other patents - they could have cited other non-patent publications though. The number is small enough to allow for that.
  • Around 16.2% do not have information about the Assignee group. This could mean one of two things - inventor data missing, OR, inventor got nothing to do with any organization.
  • Very few (0.0005%) do not have inventors. Which implies that most records with missing Assignee groups above, have inventors not affiliated with any org.
  • 3.35% do not have OCL information; 8.66% do not have XCL information. These are not disturbingly huge numbers, but it's still hard to imagine a patent not being classified into any category at all. Interestingly enough, there are quite a few records that do not have OCL but do have XCL information.
  • 99.6% of the current dataset is patents of type 'Utility'.

These issues will be documented further when the data is uploaded into the Oracle database.

Data Cleaning

The Patent Number field of our patent dataset has extra characters that are not part of the patent numbers issued by the USPTO. They vary by patent type and are as follows:

  • Utility patents have an extra character at the beginning and one at the end.
  • SIR’s have four 0’s after letter “H” and an extra character at the end.
  • Design patents have an extra character after letter “D” and one at the end.
  • Reissue patents have an extra character after “RE” and one at the end.
  • Defensive Publications have an extra character after “T” and one at the end.
  • Plant Patents have two 0’s after “PP” and an extra one at the end.


We are grateful to Steven A.Morris, Electrical and Computer Engineering, Oklahoma State University for making a larger patent data set available to us and for his guidance in parsing and analyzing the data.
This data set description was compiled by Ruchi Kapoor, Katy Börner, and Caroline Courtney.

Information Visualization CyberInfraStructure @ SLIS, Indiana University
Last Modified June 04, 2004