EMF Measurements Database - Data Set Components

EMF Measurements
Database, an EMF RAPID Program Engineering Project

Last updated: June 24, 1997

Data Sets Components

Much of the work invested in this project to date has been in the development of the database design and the specification of the components of the design. This section briefly describes each of the components.

The EMF Measurements Database is organized as a collection of data sets, where a data set is the work product of a single project or study. Each data set is represented by a one or more Data Products, as one would expect. However, the data products alone are not sufficient for a recipient to make good use of them. The recipient also requires information about what the data are, the manner in which the data were collected, and how the data are represented in the data products. This is the role of the Metadata. Further, it may be useful for the recipient to review previous results obtained from the data set. This is the role of the data set Reports.

Each data set stands within the Database as an independent entity. The degree of commonality of information across data sets is determined by the information that the corresponding studies collected. However, the Database will attempt to reduce variability in the format of data products and documentation, so that the difference or sameness of data sets is not unduly obscured by differences in the format of data products or documentation. Principally, this effort takes the form of standardized formats for distributed material.

Naturally, for each of the data set components, the Database will accept contributions of information in a more diverse range of formats than it will distribute. That is, the Database's input specifications are much less stringent than its output specifications, over which it exercises more control.

Data Products
Data products are provided by the EMF Measurements Database in two basic forms: binary data products and delimited ASCII data products. Data products are ordinarily distributed in a compressed format (typically in a ZIP archive), containing the data product file or files and a separate file containing the user license information. The Database will attempt to accomodate reasonable requests for alternate formats.
- Delimited ASCII data products
  The delimited ASCII data product is intended as a generic format that can be imported easily using analysis software. In a delimited ASCII data product, the data is presented entirely as text (in the ASCII character set). Numeric data is represented by character strings, rather than in binary encodings. The values within the data product files are separated by some specified delimiter characters. Virtually all general purpose analysis software is capable of reading this format, so it provides a lowest common denominator solution.
  
  This format is organized into records and fields. Each record is made up of a consistent set of fields. Values within the record are not individually labeled, but rather their meaning is inferred from its position within the record. Data in this format may have a variable number of records, but the number and order of fields within the records of a data product does not vary. This organizational model of data is very pervasive. It is used for spreadsheets, relational database tables and most other general purpose analysis tools, so it maps well to these tools.
  
  Both field delimiters and record delimiters must be known in order for an analysis program to properly import the data. This information, along with descriptions of each field and other information is provided in the metadata.
  
  A disadvantage associated with this kind of `flat' format is that information in some fields often doesn't change across large blocks of records and considerable redundancy can be introduced. Also, representation of numbers as text strings is less space-efficient and potentially less precise than in a binary format. Where these disadvantages are present, the Database provides the corresponding data as a binary data product.
- Binary data products
  A binary data product is specifically one that is not a delimited ASCII data product, but one whose format can be described in terms of a sequence of nested structures. The delimited ASCII data product has the advantage that it is easily imported by most analysis software. However, because of its flat structure the ASCII data product is often quite repetitive and bulky, and a more compact representation is sometimes desirable. That is where the binary data product comes in. The compactness of the binary data product make it practical to represent the full detail available for the data, which might be impractical in the delimited ASCII format.
  
  Binary data products, in general, won't be directly importable by general purpose analysis software. On the contrary, the format will usually vary between data sets and require custom programming to extract the information. However, binary data products may contain more detailed information than the corresponding ASCII data products. Naturally, sufficient documentation for the binary data product format is provided in the metadata.
  
  The binary data product represents a different choice in the trade-off between ease of access on one hand, and compactness and completeness of detail on the other. The appropriate choice is dependent on the needs and abilities of the recipient.
  
  A binary data product format is defined in terms of named data segments and composites of data segments. Each named segment of the format is documented and described in terms of its size (which can be fixed or variable) and position (relative to the beginning of the file or some other named data segment). At the finest granularity, the segments contain the data values. For a more detailed description, see the binary data product example.
Metadata
Metadata provides a description of the data set, overall as well as for each data product. The name `metadata', means `data about data.' It serves two primary purposes in the EMF Measurements Database: as a `catalog' and as a `manual'.

Metadata provides a description that will be useful in determining if a data set is appropriate for the database user. It does so by describing the nature of the measurements that were made, as well as the context and the manner in which they were collected.

Metadata also provides detailed documentation on each of the data products in a data set. This documentation allows the recipient to decipher and understand both the structure and the values within the data products.

Metadata are currently available from the Database in two forms: online, as hypertext HTML (see the EPA metadata for an example); and offline, as a printed document. The Metadata Content Specification for the Database provides a detailed description of the information that the metadata will provided for each data set.
Reports
In order to provide a quick introduction to a data set and, in particular, to results derived from it, data set descriptions may be accompanied by reports submitted by those who have analyzed the data. A report will consist of text, tables and graphics summarizing the results of an analysis of the data set. The Database does not have the resources to generate the reports, but it will accept, edit, and make such reports available in association with the data sets it distributes.