Internet Access to a Biomedical Text/Xray Image Databank

Thoma GR.,Berman LE.,Long LR.
Internet Access to a Biomedical Text/Xray Image Databank.

Keywords: Digitized medical xrays, electronic xray archives, client server design, optical jukebox, RAID, image processing, multisocket transmission, Internet

Abstract: Medical radiographs and associated data collected as part of a nationwide health survey in the U.S. are digitized and stored in an electronic archive accessible over the Internet. This paper describes the prototype system developed for the archiving of the data and the client software to enable a broad range of end users to access the archive, retrieve text and image data, display the data and manipulate the images.

1. INTRODUCTION

In the course of the second National Health and Nutrition Examination Survey (NHANES II), a nationwide survey conducted by the U.S. National Center for Health Statistics (NCHS), radiographs and other "collateral" data (e.g., demographic, socioeconomic, physician's exam and medical lab data) were collected on a representative population sample of about 25,000 people in the United States. The xrays, numbering over 17,000, were digitized and archived in an electronic storage system. The xrays are of cervical and lumbar spine; when digitized, the cervical images are each about 5 Megabytes in size and the lumbar images around 10 Megabytes[1].

To date, the textual data from the survey has been extensively used by researchers, but the xrays have not, due to the logistics problems associated with the conventional physical distribution of film: shipping, receiving, and the need for security from loss, theft and environmental degradation. As a consequence of this inaccessibility, the NHANES films have been borrowed only nine times since 1974. In contrast, a MEDLINE search in 1994 returned over 800 citations corresponding to studies using the NHANES biomedical and demographic ("collateral") data, i.e., the data other than the xrays. This data has been extensively used for epidemiological research addressing questions related to arthritis and musculoskeletal diseases[2], the association of weight, race, and occupation on osteoarthritis of the knee[3], prevalence of scoliosis[4], breast cancer[5], heart disease[6], and other health questions[7]. The collateral data to conduct studies such as these are available from NCHS in the form of public use data tapes[8]. Providing easy access to the film images over the Internet, we believe, could promote many uses such as: epidemiological and biostatistical research; establish population norms; develop radiographic atlases; train radiologists, rheumatologists and orthopedists in uniform reading of xray images with the use of a standardized radiographic atlas; conduct image-specific research in image processing, feature classification, and image database management; and education of medical students and xray technicians[9].

The challenge addressed in this paper is to provide efficient access via the Internet to these large images and the collateral (text) data for a broad user community to enable the potential uses mentioned above. Hence the motivation to digitize the film and provide access to the corresponding digital images along with the text. This paper discusses the design considerations in the server, storage and client systems.

2. SYSTEM CONSIDERATIONS

Creating a suitable electronic archive and making the data available to a broad user community requires considering the following: digital capture of the xray films at a suitable spatial and contrast resolution; quality control of the digital images; storage system design and image organization to promote rapid access over the Internet; transmission techniques for rapid image retrieval via the Internet.

2.1 Digitization and quality control

The NHANES II cervical and lumbar spine radiographs were digitized with Lumisys laser scanners with a spot size of 175 microns. The corresponding scan density in the neighborhood of 2K x 2K pixels delivers images of approximate size 5 MB (cervical) and 10 MB (lumbar). The scan density chosen has been strongly suggested by several studies[10],[11]. Image capture is followed by a three-tiered quality control procedure of which the first two are done by technicians who check for correct image orientation, adequate contrast and whether subject identification is removed from the images as a privacy measure. The third and final level (QC3) of quality control consists of checks to determine that the specific medical content of interest is in fact perceptible in the digital images. QC3 is carried out by a physician with specialized knowledge of skeletal anatomy and radiological images. The images are displayed on a 2K x 2.5K high- resolution monitor attached to a Sun workstation, and are examined for the detectability of features that are important to assess osteoarthritis in the spine, such as osteophytes, subluxation, sclerosis, and disc space narrowing. In the images that have undergone QC3 to date, these features have been judged to be detectable in over 95%.

2.2. Data storage

The system in which the images and collateral data are stored consists of a combination of a 144-platter Hewlett-Packard HP100 optical disk jukebox and a Sun SPARCStorage Array Model 101 RAID system. This hybrid storage system is controlled by two independent servers, a Sun 670MP and a Sun Sparc 20. The Sun 670MP is equipped with four SuperSPARC 40 MHz Cypress Ross cy605 chips, and has 128 MB of RAM and two 1 GB internal SCSI disks. The Sun Sparc 20 is equipped with two SuperSPARC Texas Instruments 60 MHz chips, 64 MB of RAM and 1 GB of disk space.

The storage required may be computed as follows:

Cervical spine, 1463x1755x16/8 bits/byte =  5,135,130 bytes/image
Lumbar spine,   2048x2487x16/8 bits/byte = 10,186,752 bytes/image
5,100 cervical spine images x  5,135,130 =  26,189 Megabytes
11,900 lumbar  spine  images x 10,186,752 = 121,222 Megabytes

For a total requirement of 147.4 Gigabytes.

The optical disk jukebox was procured with this required capacity in mind. Considering the possiblity that 2-3:1 lossless compression, or even lossy compression might be used in the system, it was determined that a jukebox with capacity of 70-80 GB would be acceptable. The jukebox acquired contains 144 erasable magneto-optical platters. Each platter is formatted at 512 bytes/sector and provides for 283 MB of storage per platter side, for a total jukebox capacity of 144 x 2 x 283 = approximately 81.5 GB of storage.

In addition to capacity, other considerations in the jukebox acquisition included:

Multiple internal drives. To minimize bottlenecks and increase reliability, the acquired system has four drives. This enables the jukebox robotics to move platters to one or more drives while data is being simultaneously read or written on another drive. This also permits redundancy in the event of drive failure. Automatic detection of a failed drive and switchover to a working drive is a requirement of the system.

Transportability of the media. To minimize reliance on vendor- proprietary software, each platter side has its own complete Unix file system. This allows the platters to be read on any manufacturer's optical drive capable of reading standard Unix files, eliminating the need for vendor-specific optical disk management software, and also allowing the flexibility of physically removing a platter and transporting it to some other site, should that requirement arise.

A standalone optical disk drive. To allow access to stored data in the event of catastrophic jukebox failure, a single-disk standalone optical disk drive is part of the system.

Data within the jukebox may be considered to be "online" if the data is on a platter currently loaded into a drive, or "near-line" if it is not loaded into a drive, but resides in one of the platter slots internal to the jukebox. Since there are 144 platters and only four drives, most of the platters will be near- line at any given time. Even though there are multiple internal drives, there is only one data path in and out of the jukebox; hence reading data from multiple drives is necessarily sequential.

Accessing data files from the jukebox involves several steps:

If the drive to be used already contains a platter, that platter must be "spun down" to zero velocity.
That platter must be unloaded from the drive.
That platter must be returned to its slot.
The new platter must be retrieved from its slot.
It must be loaded into a drive.
It must be "spun up" to the angular velocity required by the drive.
The read/write heads of the drive mechanism must be moved to the track containing the data.

The data must be read.

Measured times for these operations add up to about 25 seconds to read a 5 MB file from the jukebox to memory, when the platter on which the file resides is not in a drive, and another platter must be removed to make a drive available. Part of the delay is due to the jukebox robotics operations, while some of it is due to the disk drive mechanism.

To offset the relatively slow retrieval from the optical jukebox, the storage system is augmented with the RAID system consisting of 18 1.05 GB SCSI-2 hard-drives, six independent fast buffered SCSI-2 buses, and connected to an Sbus card hosted in a Sparc 20 model 612 via a 25 MB/s fiber channel connector (upgradable to a 100 MB/s), under Solaris 2.4. While tradeoff studies are proceeding to select the RAID configuration that balances the objectives of cost, speed and data reliability, the configuration tentatively selected is Level 5. Of the six RAID configuration levels agreed upon by the industry, designated as RAID 0 through RAID 5, only RAID 0, 1, 3 and 5 are generally accepted for practical applications; RAID 2 and 4 are theoretically possible but usually considered impractical due to lower performance levels. RAID 5 stripes user data across the disk array while implementing a scheme for storing parity data without creating an I/O bottleneck. This bottleneck is avoided by evenly spreading, or interleaving, parity data across all drives rather than specifying one drive as the parity drive. The use of parity secures data and makes it possible to reconstruct lost data in the event of a drive failure. A disadvantage of RAID 5 is that when new user data are written to disk, new parity data must be generated given the old user data. The requirement to read, generate and rewrite the parity data can slow the write I/O rate. However, the overall advantages of RAID 5 are: it maintains a high I/O read rate; the data is secured against drive failure at a small cost in disk storage; and it spreads the I/O transactions across all drives. With the requirements of this archive in mind, where the data is relatively static and encounters few new entries (unlike a bank transaction system, for example), the advantages of RAID 5 outweigh its drawbacks. Hence the selection of RAID 5 in this system.

2.3 Image datasets

The design of the storage system and image organization assumes that user needs broadly fall into two classes, one foregoing high image resolution for rapid retrieval, while the other choosing the opposite. For instance, epidemiologists, electrical engineers, computer scientists, and statisticians might be interested in the raw uncompressed data which implies longer transmission time and potentially large numbers of images. On the other hand, medical students, technicians, students interested in anatomy, and commercial businesses might prefer realtime access which would suggest compressed lossy images but with reasonably good quality. To accommodate these different requirements, the images are stored in two different formats: full-resolution, "raw" images uncompressed at present, and GIF files that are decimated (by a factor of 16) and losslessly compressed by the Lempel-Ziv technique. Being small, the latter are stored in the RAID and quickly retrieved, while the large raw images are stored in the jukebox.

2.4 Selection and organization of the collateral data.

The collateral data used with this archive serves three purposes: 1) to narrow the subset of images returned from a query, 2) for background information on each image, and 3) for epidemiological research. The database design is based on an object-oriented, relational model. Illustra, a commercial database package, is chosen because it is robust, uses SQL, has features that provide for security and data protection, is well supported, and is relatively inexpensive when compared with other commercial database packages. The database schema for the entire collateral data set is based on fields such as age, sex, ethnicity, height, weight, and geography, though partitioning based on physical exam results or lab tests might be considered.

The collateral data records for each Survey Person are large and are grouped into about fifteen major data sets. Each data set repeats a subset of demographic data containing more than 100 fields. For example, the Physician's Examination data set contains over 350 fields describing examination results which include blood pressure, eye, ear, nose, and throat, thyroid, chest, heart, abdomen and kidney, joints, musculosketal, neurological, and skin test results. An initial problem is the selection of the total data available to a subset which (1) is of a size tractable for development of a prototype system and (2) is useful in isolation from the rest of the NHANES data collected.

For prototype development our approach has been to focus on a few dozen fields of demographic data, plus fields judged to be most relevant to osteoarthitis of the cervical and lumbar spine. The total number of fields included in the prototype is 60-70. For the initial implementation, the data is being organized according to a simple hierarchical scheme: major heading (category), minor heading, and feature. For example, in the Physician's Examination data set, the triplet Back, Limitation of Motion, Thoracic spine represents an instance of the hierarchy. The field value for "Thoracic spine" indicates whether limitation of motion was found in that location (of the back).

Using this approach to the data reduction and organization, a client suitable for accessing the text and images over the Internet has been designed. It is code- named MIRS, for Medical Information Retrieval System.

2.5 MIRS client interface

MIRS, implemented on a Sun Sparc 10 platform, is designed to optimize user search and retrieval capability for general mixed text and image databases. The MIRS design provides full SQL search capability, as well as search with graphical aids, and capability to rapidly scroll returned text and images. The MIRS client interface operates on the Motif/X-windows systems. MIRS uses the Illustra database server; this has the advantage of a state-preserving server with powerful SQL-compliant, database management; it has the disadvantage of requiring a commercial Illustra license for each client. In order to expedite transmission of image data, MIRS initially provides the low-resolution versions of the images corresponding to a query, followed by the high-resolution images upon specific request by the user.

As MIRS is a client/server database application whose objective is to provide a user with an easily assimilated paradigm, it has the capability to create and execute SQL queries on mixed text/image databases, it has graphical aids to assist the user in building SQL queries, and query results are displayed on a screen which integrates both images and text. All data resulting from a query is displayable, by scrolling, if necessary. The query results screen displays multiple images at reduced resolution and provides graphical aids to enable the user to quickly scan through the returned data, and to readily view the text results associated with particular images. Any image may be displayed in full resolution at the user's option.

Data retrieval follows a staged approach. The user first queries the collateral database to focus in on a subset of the image database. In a typical scenario, a user is presented with a query form which allows the user to construct a query consisting of a logical combination of age, ethnicity and gender. When the user submits the query the client transmits this to the server which passes the query to the Illustra gateway interface. The gateway is responsible for formulating an SQL macro to submit to the Illustra database engine. The number of images matching the query will be reflected in the narrowness of the question posed. For example, a query for Mexican-American females over age 60 will result in a smaller image set than a query for all females.

To prevent overwhelming the user with a very large number of hits, a user-defined group set size determines the number of images delivered with successive user requests. Every time the user requests more data, the server downloads a group of low resolution images and the corresponding collateral data. The user may cycle through this group, and may request the high resolution image by clicking on the particular low resolution image.

2.6 Image transmission

In conventional Internet data transmissions, two communications end-points ("sockets") are connected by a single logical channel, and communication takes place between this single socket-pair, under control of TCP/IP protocol. We have developed an experimental technique that achieves a two- to three-fold improvement in speed over conventional Internet file transfer methods, such as FTP. In our "multisocket" method, the image to be transmitted is divided into segments by the sender; an independent logical communication channel, defined by its own unique socket-pair, is established for each segment; and the data is sent down these channels by a multitasking operating system. At the receiving end, the segments are reassembled into their original order.

A plausible explanation for the speed improvement follows. In conventional data transmission, the data to be sent may be conceived as a single line of packets to be transmitted in sequence. When a data channel is opened on the Internet for TCP/IP transmission, the flow of the data is contolled by the TCP protocol layer. Data packets are not sent as fast as the sender can transmit; rather, TCP gives permission to the sending computer to transmit packets which are marked as being within the current TCP "window", a group of contiguous packets within the line of packets to be sent. As acknowledgments are received for previously- transmitted packets, TCP moves this window to give permission to send additional packets. Packets beyond the boundary of this window must wait until the window moves to include them before they can be transmitted. For communications links with long delays between the end-points, the wait for acknowledgments for previously-sent packets may be lengthy, resulting in idle time.

The multisocket method exploits this idle time by using multiple data streams within a multitasking operating system. Each data channel is assigned to a separate "process" which receives control of the CPU on a time-shared basis. Each data channel has its own flow control, governed by its own TCP window. While packets in one channel are waiting for the TCP window to move, packets in other channels may be within their TCP windows, and will be sent promptly upon receiving control from the operating system. Since large images use up considerable transmission time, this method will be incorporated in the MIRS design.

3. EXTENSIONS TO MIRS

In summary, MIRS is built on the the concept of using industry-standard technology to provide access to mixed text/graphics databases in a powerful and integrated design. Industry-standard technology incorporated includes SQL search techniques, a Motif/X-windows based user interface, and the use of a relational database manager (Illustra) with object-oriented extensions. The MIRS client is intended to run on a conventional Sun workstation platform, such as a Sparc 10.

A basic consideration in the design of MIRS is the use of a graphical user interface (GUI) offering the type of controls (buttons, sliders, text entry fields, mouse interaction, etc.) which have become standard in human-computer interfaces. SQL queries, for example, may be formulated by simple interaction with screen buttons, list boxes and edit boxes.

Beyond this, it is desirable to enable the user to alter the image for better viewing in terms of more contrast, convenient viewing orientation and greater spatial detail. Image contrast may be enhanced, for instance, by histogram equalization, both global and region-limited. Also, to be provided are functions which reorient the pixels on the screen such as Flip and Rotate, and to view the processed image in more detail via a zooming digital magnifying glass.

Context-sensitive help would be useful, especially for those who would use MIRS occasionally. By placing the mouse pointer over any control and pressing the HELP key on the keyboard, the user should be able to pop up a window of information relevant to that control. This implementation keeps unnecessary and possibly confusing text off the main screen, while allowing access to help at any time. Also, as an aid to the user, text resources such as abstracts or documents describing the NHANES survey may be included in the database.

Also being considered are tools for statistical analysis to plot or classify the returned data; mensuration tools to measure distances (say, between vertebrae), areas or gray density; tools to convert units (say from the metric units used in the NHANES survey to more familiar English units in the returned data); and the capability of saving query results locally.

Contact:  George R. Thoma, Ph.D.
          National Library of Medicine 
          8600 Rockville Pike
          Bethesda, MD 20894 USA
          Phone: 301 496 4496      Fax: 301 402 0341
          Internet: thoma@nlm.nih.gov

4. REFERENCES

1. Thoma GR, Long LR, Berman LE. Access to a Digital Xray Archive over Internet. Proc. SPIE, Enabling Technologies for High-Bandwidth Applications. Vol. 1785, Sept 1992, pp. 79-86.

2. Lawrence RC, Everett DF, Hochberg MC. "Arthritis". Chapter in: Cornoni-Huntley JC, Huntley RR, Feldman JJ, eds. Health Status and Well-Being of the Elderly: National Health and Nutrition Examination Survey-I Epidemiologic Follow-up Study. Oxford University Press, New York, 1990; pp. 136-51.

3. Anderson JJ, Felson DT. Factors associated with osteoarthritis of the knee in the first national health and nutrition examination survey (NHANES I). Evidence for an association with overweight, race, and physical demands of work. Am. J. of Epidemiol. 1988. 128: 179-89.

4. Carter OD, Haynes SG. Prevalence rates for scoliosis in US adults: results from the first National Health and Nutrition Examination Survey. Int. J. Epidemiol. 1987. 16: 537-44.

5. Swanson CA, Jones DY, Schatzkin A, Brinton LA, Ziegler RG. Breast cancer risk assessed by anthropometry in the NHANES I epidemiological follow-up study. Cancer Res. 1988. 48: 5363-7.

6. Cooper RS, Ford E. Comparability of risk factors for coronary heart disease among blacks and whites in the NHANES-I epidemiologic follow-up study. Ann. Epidemiol. 1992. 2(5): 637-645.

7. Vital and Health Statistics: Data Systems of the National Center for Health Statistics, Series 1, No. 23, March 1989, U.S. Department of Health and Human Services, Public Health Service, Centers for Disease Control, pp. 14-16.

8. Vital and Health Statistics: Data Systems of the National Center for Health Statistics, Series 1, No. 10a. "Plan and Operation of the Health and Nutrition Examination Survey: United States-1971-1973", February 1973, U.S. Department of Health, Education, and Welfare, Public Health Service, National Center for Health Statistics, p. 8.

9. Lawrence RC. Getting the Message Out: Using Digitized Radiographs from NHANES II & III. Memorandum to Digitized Radiographic Images: Challenges and Opportunities Workshop. June 1993, Bethesda, MD.

10. Wegryn SA, et al. Comparison of digital and conventional musculoskeletal radiography: an observer performance study. Radiology. 1990; 175:225-8.

11. Seeley GW, et al. Total digital radiology department: spatial resolution requirements. AJR. Feb 1987; 148:421-6.

Communications Engineering Branch
webmaster@archive.nlm.nih.gov
Last Update: