ARM External Data Center (XDC) Requirements

Joyce L. Tichler
August 1996

1.0 Introduction

This specification establishes requirements for the Atmospheric Radiation Measurement (ARM) External Data Center (XDC).

It will be used to guide the transition of the management of a class of data known as "external data" from the ARM Experiment Center (EC) located at Pacific Northwest National Laboratory (PNNL) to the XDC which is to be located at Brookhaven National Laboratory (BNL). This document will discuss the requirements for the daily running of the XDC. It will also focus on the requirements to provide a development environment to generate new procedures to acquire and process externl data.

1.1 Purpose of the ARM XDC

The purpose of the ARM External Data Center is to acquire in a timely fashion and consistently deliver to the ARM Science Team, Experiment Center and Archive data from sources outside of the ARM Program which will augment the data collected at the ARM sites. These data, know as "external data" are to be of known and reasonable quality.

1.2 Overview

The ARM XDC acquires data from external data sources for use by ARM Science Team members in their research. The ARM Project is instrumenting three Cloud and Atmospheric Radiation Testbed (CART) sites located in the Southern Great Plains (SGP), the Tropical Western Pacific (TWP) and the North Slope of Alaska (NSA). Guidance from the ARM Science Team and ARM infrastructure help determine which external data, covering the area of the ARM CART sites, is available and should be acquired. These data are transformed from their original received form to data sets of greater value to the ARM Science Team. These transformations may involve subsetting, merging distinct sets, changing format, averaging or applying algorithms to produce new geophysical parameters.

Since the SGP site has been operational since 1992 and the TWP and NSA sites are scheduled to come on-line in the fall of 1996 and sometime in 1997, respectively, the majority of the external data collected to date support the research being done on the SGP.

The XDC is located at BNL and staffed by the members of the Scientific Information Systems Group. The output external data files, known as "platforms" are delivered to the ARM Experiment Center, the ARM Archive and the ARM Science Team. The flow of information through the system is shown in Figure 1

1.3 General Requirements

The XDC operates 7 days a week, 24 hours a day. Staff of the XDC are available during BNL's normal working hours from 8 AM to 5:30 PM, Eastern Local Time. The XDC collections and ingests must be capable of being run in an automated mode. The XDC must be capable of operating unattended.

The XDC will keep online the last 3 months of all external data platforms except satellite data. The last 30 days of satellite data or 20 Gbytes (whichever is less) will be kept online.

2.0 The ARM XDC Customers

The ARM XDC customers are the entities to which the ARM XDC provides some type of output. They are discussed in greater detail below.

2.1 The ARM Experiment Center (EC)

The ARM Experiment Center is located at PNNL. Its role in the ARM program is to package and deliver data from all the ARM Cloud and Radiation Testbed (CART) sites and from the XDC to the ARM Science Team and the ARM Archive.

Data to the Science Team is delivered based on Experiment Operations Plans (EOPs) scientists submit. The EC also runs value-adding processes which apply algorithms to one or more data streams (either ARM generated or external data) and produce new "value-added" data streams. It is the responsibility of the EC to track all data streams. The XDC supplies the EC with data and/or metadata needed by the EC to fulfill its responsibilities.

2.2 The ARM Archive

The ARM Archive acts as a permanent repository for all ARM data and metadata and a source of ARM data for the general community as well as for ARM infrastructure and the ARM Science Team. The XDC provides the Archive with external data and accompanying metadata and also uses the Archive as a repository for XDC programs, data bases and other records.

2.3 The ARM Science Team

The ARM Science Team performing its research may come to the XDC to obtain external data files, metadata concerning the external data or records of what external data files exist. Members of the Science Team may also request that new types of external data be found to fulfill unmet measurement needs.

2.4 The ARM Data and Science Integration Team (DSIT) Leader

The ARM DSIT leader manages the Data and Science Integration Team. The DSIT Leader represents DSIT to both the DOE ARM Program Office and to the ARM Science Team. The DSIT Leader must know what external data sets are available over what time periods and must understand how the XDC operates. The DSIT Leader needs to be aware of any operational problems that occur at the XDC.

2.5 Value-added and ingest procedures (VIPs) developers

VIP developers will provide the XDC with production level scripts and software which produce value added data streams and use external data as input. The VIAP developers will also provide methods and tools to assure data quality for the value added products generated by their software.

2.6 ARM Problem Review Board (PRB)

The ARM Problem Review Board reviews all data quality reports (DQRs). It will provide the XDC with DQRs that have been submitted concerning external data and products generated from external data. The PRB expects the XDC to deal with questions concerning the quality of the external data platforms.

3.0 The ARM XDC Suppliers

The ARM XDC suppliers, External Data Centers, provide external data to the ARM project. They also provide the XDC with metadata about their data sets and sometimes with quality assurance information. They provide the XDC with information on the availability of new data sets. The individual External Data Centers are discussed in greater detail below. Table 1 lists the current external data platforms, the number of files generated per day, file size and size per day.

Note:

a. * The file size varies; the maximum size is listed.

b. ** File is collected monthly.

3.1 Oklahoma Mesonet

The Oklahoma Mesonet provides surface meteorology data from 111 stations located within Oklahoma. These data are available to the XDC on Oklahoma Mesonet computers. The OK Mesonet has provided ARM with a login account. These data are only to be made available to ARM Science Team members. Members of the general public must contact the Oklahoma Mesonet directly to obtain these data. Files are retrieved in near real-time on a daily basis. Data are available in either 5 minute or 15 minute averages. These data are currently made available as ASCII files. The Oklahoma Mesonet plans to provide quality assurance information on their data at some future time.

3.2 Kansas State Climatologist

The Kansas state climatologist provides ARM with quality assured surface meteorology data from 15 stations within Kansas. The data are made available to the XDC for pickup from an anonymous ftp site on a University of Kansas computer. The files are made available approximately one month after they are generated. Data are available as hourly averages and daily averages, maxima, minima and sums. These data are made available as ASCII files.

3.3 National Centers for Environmental Prediction (NCEP)

The NCEP computers provide some of the data sets used as input to the weather prediction models as well as the outputs of the models themselves. These data are available via anonymous ftp. The data sets currently obtained from NCEP include analysis products from both the Eta[1]and the RUC (Rapid Update Cycle) model. These data are provided in GRIB format. The XDC picks up four Eta files per day and eight RUC files per day.

National Weather Service surface station data are retrieved from NCEP for operational use. These data are later replaced with quality assured versions (see 3.6).

In all cases,the files must be captured within one day of the time they are generated or they are no longer available.

3.4 SeaSpace

ARM has a contract with SeaSpace under which SeaSpace provides satellite data from the GOES, POES and GMS satellites. These data are provided in near real-time on the XDC computers via anonymous ftp. Backup data files are also made available on tape shipped on a monthly basis. These data are provided in TDF (Terascan Data Format). The GOES data are available at least hourly, but files may be available every 30 minutes. The AVHRR data from the POES satellites are usually available twice a day from the two currently orbiting satellites. These data sets are the largest in volume and provide the greatest storage challenge to the XDC.

The XDC will soon be adding POES data covering the TWP (Tropical Western Pacific) which will be provided via a contract with James Cook University.

3.5 Forecast Systems Laboratory (FSL)

FSL provides ARM with water vapor measurements over a subset of the Wind Profiler Demonstration Network (WPDN) stations. These data are the result of the analyses of data from GPS (Global Positioning Systems) located at some WPDN sites. These data are put in the anonymous ftp area of the XDC computers approximately one day after collection. The data are 30 minute averages, provided as 48 netCDF files per day.

3.6 National Climatic Data Center (NCDC)

The NCDC makes National Weather Service (NWS) quality assured surface meteorology and upper air sounding data available to the general public via a Web site. These files are supplied in BUFR format. The files for a given month are available approximately three months after they are generated.

3.7 Arkansas Basin Red River Forecast Center (ABRFC)

The ABRFC generates hourly netCDF files which contain gridded, 4 km. resolution precipitation estimates for the Arkansas Basin Red River Forecast area. These files are made available in near real-time by ABRFC via anonymous ftp. The XDC retrieves these files on a daily basis.

3.8 UCAR Office of Field Project Support (OFPS)

The OFPS provides the Global Energy and Water Cycle Experiment (GEWEX) with data management support. GEWEX and ARM have a cooperative data sharing arrangement. The XDC is acquiring high resolution (6 second), quality assured, upper air NWS soundings from OFPS. These data are provided as available, normally delays are of the order of 6 months. The files are provided in EBUFR format.

3.9 Other External Data Centers

The XDC must be prepared to deal with other External Data Centers which have data needed by the ARM Science Team. The Tropical Western Pacific (TWP) ARM site will be coming on-line later this year. Preparations to obtain data for that region are already underway. The North Slope of Alaska (NSA) site will follow and necessitate the identification of still other data sources.

3.10 Brookhaven National Laboratory (BNL)

The XDC is located at BNL. It must therefore conform to the work schedule, safety and security standards established by the laboratory. Since the XDC is a part of BNL, it benefits from the overhead and infrastructure support provided by the laboratory.

4.0 The ARM XDC Requirements

The requirements for the ARM XDC were determined in a facilitated design session which was attended by representatives of the XDC user community. The output of the meeting was an operational model of the running of the XDC. The requirements have been grouped into five categories and are discussed below.

4.1 Acquire new data, tools and processes

In order to satisfy the changing needs of the ARM Science Team, the XDC must acquire new data, and developing the tools and processes required to ingest the data and quality assure it.

4.1.1 Acquire new external data

When the XDC acquires new data sets, either from External Data Centers or as the output of a new value-added product, it must notify the EC and the Archive of the existence and availability of these new data. This notification is made via a "notification form". The standard for notification forms has been established by the EC. The XDC must keep an up-to-date inventory of available and planned external data platforms. The inventory should include information on the potential size of files as well as detailed information on the content of the files.

4.1.2 Build value-added and ingest procedures (VIPs)

When new VIPs are made available by VIP developers, the XDC must install the new process into the XDC pseudo-production and then production environment. The procedure to be followed will be modeled on that currently in place at the EC.

4.1.3 Develop, acquire, and maintain "data use" tools

The XDC will put in place tools to assist the ARM Science Team and the members of the infrastructure performing quality assurance and using the external data in their research. These data tools may be developed within the XDC or may be provided by External Data Centers and VIP developers. The XDC must maintain the tools under configuration management and see that they are properly documented.

4.1.4 Explore future external data streams

XDC staff must be in contact with VIP developers and External Data Centers to learn of the possible existence of new external data streams. Information about these new data must be made available to the EC and the Archive so that they can plan for their storage and notify their users of the future availability of the new data.

4.1.5 Compile information and tools to regenerate external data streams

The information and tools needed to regenerate external data streams from the raw data files must be compiled and available for the XDC and for the ARM Reprocessing Center, should that reprocessing be required.

4.2 Build and provide XDC development and production environments

In order for the XDC to come into existence it is necessary that a production and development environment be created. These environments must be documented and that documentation maintained as the XDC evolves. The tasks involved in achieving this are grouped into four areas and discussed below.

4.2.1 Acquire necessary hardware

Before the necessary hardware can be acquired, it is necessary to research the options available. A detailed specification of each piece of equipment required must be produced as well as the justification for that choice.

The location of the new hardware must be determined and that location must be prepared to house the equipment to be purchased.

When the hardware arrives it must be assembled, interconnected and tested.

4.2.2 Provide necessary environment

The XDC production and development environments must have an agreed upon directory structure and environmental variables. Sets of users and groups must be defined for both production and development.

It is also necessary to define host names, create automount maps, define the local area network environment and arrange for backups.

4.2.3 Provide necessary software

The XDC production and development environments must have an agreed upon operating system installed. The public domain software to be used must be itemized and installed. The necessary proprietary software must be itemized and installed.

The software developed within the ARM infrastructure which is required by the XDC must be itemized and installed. The ingests and collections for the external data must be installed.

Automated transfer procedures to move files between the XDC and the EC and the XDC and the Archive must be in place.

4.2.4 Provide necessary operational environment

In order for the XDC to become operational, the manuals and plans listed below must be provided. These manuals and plans must be maintained under configuration control and should be available via a Web interface.

* Users Manual for XDC Development System

The Users Manual will establish the rules for the various software developers using the XDC Development System. The manual should contain the following information about the XDC Development System: the hardware and software configuration, user responsibilities, environmental variables, directory structure, list of users/groups and permissions, list of all the currently collected external data streams.

* Operations Plan for XDC Production system

The operations plan documents the plans for the operations of the XDC Production system. The plan will include information on both hardware and software configuration. It should also include all operator procedures. The plan should list all currently collected external data streams.

* Operations Manual for XDC Production system

The Operations Manual for the XDC Production system will provide guidance to the XDC operations staff and will explain the operating environment of the XDC Production System to its customers. The manual should include information on the hardware and software configuration of the XDC Production System and should detail operator procedures. The manual should list all currently collected external data streams. It should also contain information on how to contact the various suppliers of the external data sets.

* Maintenance Plans for XDC Development and Production systems

The plans should include information on how both hardware and software will be maintained and an estimate of annual costs involved.

* Security Plans for XDC Development and Production systems

The security plan should discuss provisions for controlling access to the XDC Development and Production systems and the reporting mechanisms to be used to report attempted breaches of security.

4.3 Quality assure external data

The XDC must quality assure the external data it delivers to the Archive and the EC. The quality assurance is divided into the two functions shown below.

4.3.1 Determine reason for missing external data streams

When there is a discrepancies between the number and size of files expected and received, the XDC operator must determine the reason. This will involve verifying that the jobs which should have fetched the data streams ran as scheduled. The operator may also check to see if a back-up procedure fetched the missing files. If not, the operator may then rerun the jobs that fetch the file. If the files are not available, the operator will attempt to contact the source of the data to obtain an explanation of why the files are missing. The fact that files are missing or of unexpected size must be captured so that the various XDC customers will be aware of missing files and if known, the reason that they are missing.

4.3.2 Perform quality assurance on available external data streams

Routine quality assurance tests should be applied to the external data and value added products provided to XDC customers. Such tests should identify outliers, too rapid changes between sequential data points and other such metrics.

4.4 Manage external data streams

The XDC production environment must manage external data streams. The requirements which are part of this management are discussed below.

4.4.1 Collect raw external data

The procedure for collecting raw external data streams varies. In some cases a script is run which performs an anonymous ftp to a server providing the external data. In other cases, it is necessary to login to a specified computer with a given account. In still other cases, the data files are placed into the anonymous ftp area of the XDC for retrieval.

4.4.2 Ingest raw external data

The ingest converts the raw external data to a standard format, usually netCDF or HDF and creates platforms which cover a specified period of time. The platform size is chosen so as to be most efficient for use by ARM Science Team.

4.4.3 Generate value-added data streams

These procedures transform one or more external data streams, perhaps in combination with ARM data, into a value added data stream in a standard format, usually netCDF or HDF.

4.4.4 Build and maintain a Platform Data Base for external data platforms

The PDB maintained at the XDC will be consistent with the PDB maintained at the EC. However, the PDB at the XDC will only deal with external data platforms. Any changes to entries at the XDC will be forwarded to the EC by a mutually decided upon procedure. The XDC PDB may contain tables not available in the EC PDB. These additional tables will contain more dynamic information used to track the external data platforms while at the XDC.

4.4.5 Create and maintain lists of available and expected external data streams

Lists of what external data platforms are currently being collected, ingested, what data files exist and whether there are copies at the EC and/or Archive will be maintained. Information must be available as to whether the files expected were actually received, and whether the size of the files was as expected. This information should be available for the entire time period over which the external data platform exists.

4.4.6 Generate file delivery status reports

The XDC must generate file delivery status reports for all deliveries of external data files to its customers. The status reports reconciles the list of files requested by the customer, the list of files the XDC sent and the list of files the customer received.

4.4.7 Produce general external data stream documentation

General external data stream documentation must be made available to customers. The type of information available may include information on the supplier of the data, information on the content of the data files, and references to existing papers or reports concerning these data.

4.4.8 Generate data packing slips

The data packing slip will be sent to the customer along with requested data. The packing slip will list the files and file sizes being sent in response to the request. It will also include an explanation of missing requested data streams, appropriate entries from the PDB and DQRs.

4.4.9 Generate list of delivered files

A list of all files delivered and to which customers must be generated as part of daily reports on XDC performance and statistics.

4.4.10 Log the delivery of data

It is necessary to maintain a record of the delivery of data to customers. The record should include date and time of delivery, method of delivery, list of files delivered and the customer.

4.4.11 Execute the file check

This automated procedure checks the various directories in which new external data streams are stored against a list of expected files and file sizes and generates a report which is emailed to XDC staff. The report lists all the external data platforms expected and indicates whether or not they were received and if there was a variance from the expected file size.

4.5 Manage and report on XDC operations

The fifth area of requirements for the XDC is to manage and report on XDC operations. These requirements have been broken into five different areas and are discussed in greater detail below.

4.5.1 Report problems in XDC operations

The XDC must maintain a record of problems in operations. The record will allow management to plan improvements in operations and to alert customers about XDC operations problems which may impact their work. The problem reports should be entered into an XDC Operations Log data base.

4.5.2 Prepare XDC operational information

This requirement refers to the need to continuously maintain and update the XDC Operations Manual which was created under 4.2.4.

4.5.3 Collect operational problems

Information on XDC operational problems, problems at the External Data Centers which are suppliers to the XDC and explanations of missing data need to be gathered for reporting purposes.

4.5.4 Generate XDC performance and statistics reports

The XDC must generate reports and statistics on its performance. These reports and statistics will follow standards being established by the ARM DSIT.

4.5.5 Write monthly reports

The XDC must report on a monthly basis to the DSIT leader on the functioning of the XDC. The information gathered by 4.5.1 - 4.5.4 serves as input to these reports.

5.0 Abbreviations

ABRFC Arkansas Basin Red River Forecast Center
ARM Atmospheric Radiation Measurement
BNL Brookhaven National Laboratory
BUFR Binary Universal Form for the Representation of meteorological data
CART Clouds and Radiation Testbed
DSIT Data and Science Integration Team
DQR Data Quality Report
EC Experiment Center
EOP Experiment Operations Plan
GEWEX Global Energy and Water Cycle Experiment
GRIB Gridded Binary
HDF Hierarchical Data Format
NCEP National Centers for Environmental Prediction
NCDC National Climatic Data Center
netCDF Net Common Data Format
NSA North Slope of Alaska
OFPS Office of Field Project Support
PDB Platform Data Base
PRB Problem Review Board
PNNL Pacific Northwest National Laboratory
RUC Rapid Update Cycle
SGP Southern Great Plains
TDF Terascan Data Format
TWP Tropical Western Pacific
UCAR University Center for Atmospheric Research
VIPs Value-added and Ingest Procedures
WPDN Wind Profiler Demonstration Network
XDC External Data Center