YUE Xinyang,SHI Xiaoxiao,SONG Xiao,GENG Shanshan,ZHENG Bing,DONG Mingmei
National Marine Data and Information Service,Tianjin 300171,China
Abstract: International buoy data has become an important data source for the operational update, and the acquisition of international buoy data is conducive to grasping the changes in the marine environment in the surrounding sea areas and improving the early warning ability of abnormal conditions. Aiming at the problems of many sources of international buoy data and the low degree of automation of data processing, this paper breaks through the key technical bottleneck of multi-source massive international buoy data collection and processing, and designs and develops an international buoy data business processing system.The system is divided into data collection, data processing, and standardization, quality control, weight collection integration, data output,data application, and management and other subsystems, the realization of international buoy data download, processing, and product output, and other functions of the business,to achieve rapid and effective use of data,has achieved good application results.
Keywords:buoy data,operational,data processing system
As an important observation platform, buoys are characterized by long-term,continuous, all-time, and automated observations[1], encompassing marine hydrology,meteorology, biology, chemistry, and other disciplines. A large number of countries and institutions operating and maintaining buoys has resulted in different types of mounted probes, various observation elements, different data formats, and a huge amount of information, thus creating a situation where the buoy sharing and publishing forms are diversified, the coding methods of websites are not uniform, the frequency is updated quickly,and the available information is more fragmented.
The construction of an operational processing system for marine environmental information is a comprehensive technology integrating marine science,GIS,and computer science[2,3]. With regard to the operational processing of international data, the National Marine Data and Information Service(NMDIS) has, after a long period of accumulation,operationally built and operated operational processing systems for global temperature and salt data,Argo real-time data,and global water level data[4],but international buoy data,as an indispensable and important part of the construction of the observation system, is slightly lagging in the construction of the operational system.In recent years,with the buoy data processing work carried out one after another, NMDIS has developed buoy data quality control procedures for each type of buoy and has accumulated a lot of experience in buoy data processing and quality control technology. However, for a long time, these processes have been manually driven,with discrete programs running in steps,and quality control has been performed based on files. Data downloading still requires manual intervention, resulting in an inability to meet real-time data processing requirements.Therefore,the downloading,decoding,processing,and management of data urgently need professional system support, and at the same time need to have the functions of data management, monitoring, and statistical analysis based on the GIS platform. The international buoy data operationalization system proposed in this paper is designed and integrated for each link involved in buoy data processing, which guarantees the orderly implementation of buoy data operationalization.
The international buoy data mainly includes drifting buoys and moored buoys, and the observation elements include wind, temperature, pressure, humidity, waves, currents,sea surface temperature, salinity, and many other elements, and the spatial scope covers the global sea area, the time range of these data is 1979-present. Data are managed and distributed mainly through international organizations and programs such as National Data Buoy Center (NDBC), The Global Drifter Program (GDP), International Arctic Buoy Programme(IABP),GDAC-France,GDAC-Canada,Tropical Moored Buoy Implementation Panel[5],etc.The data is distributed through web,FTP,mobile APP,satellite communication,email and other data exchange forms.The type of data acquisition time includes real-time,delay-time,and near real-time.
The framework of the international buoy operation system mainly consists of six parts:data source layer,basic environment layer, receiving and updating layer,data processing,database, and data application layer[6]. The standard specification system provides technical standards and management systems for the whole system, and the system security provides support for software and hardware and information processing. The system framework is shown in Fig.1.
Fig.1 System Framework
Base environment layer: consists of servers, firewalls, storage devices, network devices,and other hardware and software devices.
Receiving and updating layer: including the receipt of metadata files, receiving and preliminary inspection of data source information, carrying out data classification, collation and targeted updating, manual import of data, collection and exchange of processing results, completing the flow and log analysis of data source information and information products in each link.
Data processing layer: including data processing, publication of parsing rules for delay-time and real-time data, generation of standard data files, parsing into the library,and status monitoring query of the parsing process.
Database system: This database of the system is based on the Marine Environment Integrated Database for data loading, quality control, duplicate removal, and backup operations.
Data application layer: generating data sets, data products, providing data services,visualization, query, statistical analysis operations of various types of data, data flow management,and process monitoring.
The system is based on the Windows platform, the software system adopts Win32 architecture, and the programming language adopts standard C/C++ language. Visual Studio 2018 is selected as the project development tool,and the class library is Oracle 12C and Mysql8.0 class library.
The international buoy operationalization process mainly includes multi-source data download and aggregation. Storage and management of data classification after loading the buoy classification profile. Standardization of data after loading the buoy standardization profile. Loading the standardized data into the national marine environment integrated database.Loading the data quality control profile to perform quality control and duplication removal of the data. Generate standard data sets and data products.Conduct data monitoring and data services.The process is shown in Fig.2.
Fig.4 Schematic diagram of data ranking and integration configuration
According to the data update frequency,data characteristics,and release types such as webpage, FTP, APP, and email, real-time automatic download, regular automatic download, and semi-automatic download modules are designed to realize automatic download, analysis, and storage of each existing international buoy data. Data tracking,statistical analysis,and abnormality monitoring modules are established to display,monitor,and log the data download status.
According to the data update frequency,data characteristics,and release types such as web page, FTP, APP, and email, real-time automatic download, regular automatic download, and semi-automatic download modules are designed to realize automatic download, analysis, and storage of each existing international buoy data. Data tracking,statistical analysis,and abnormality monitoring modules are established to display,monitor,and log the data download status.
The main functions implemented in the acquisition module subsystem are as follows:
One after another, the church members shared their wishes, large and small. Margie was the last and the youngest to speak. As she looked out at the congregation, she spoke8 confidently, “I would like for my grandma to have church. She cannot walk, and she and my grandpa have to stay at home. They miss coming so much. So that is what I wish for. And please don’t tell them, for it needs to be a surprise.”
(a) Real-time automatic download module: For real-time and near-real-time data,multi-threaded download technology is used to improve data collection speed by parallel processing of multiple sources using a minute-level collection.
(b)Periodic automatic download module: For the delay-time released data and other historical data as well as different data source directories, the download module is activated artificially or periodically to download historical data information. Its download function transposes the real-time download function.
(c) Semi-automatic download module; in order to meet business needs, a customizable data download template can be used to load some files in a semi-automatic or manual download and import way.
(d) Process tracking module. It monitors the process of the download module program and can automatically restart the download module in time after the abnormal exit of the process.The system ensures that the working environment is automatically checked and resumed after disconnection and system restart, guaranteeing 7×24 business operation.
(e) Log statistical analysis module. Display information including download time,number of downloaded pages,and download status in the form of a list.The logs are saved in a daily log file.
(f) Exception monitoring module. Allows manual cancellation of individual web download processes by modifying configuration information.
The system parses data files from various sources, follows the standard record format of the information, pre-processes test methods and parameters, database standards, and other normative standards to form a unified format file to meet the post-order file classification standards. The main functions realized by the subsystem are as follows:
(a) Maintenance module of parsing rules:Read and display profile information by the interface. It supports the definition, addition, deletion, and modification functions of data parsing rules and policies for various web forms.
(c) Configuration information management module: Add, delete and modify the configuration information such as standard format and quality initial inspection parameters in the configuration interface according to actual needs. The updated configuration information is reloaded into the process.
(d) Pre-processing Inspection Module: Pre-processing quality inspection of the original data entries in the data collection library. It includes format check and data range preliminary inspection.
(e) Code conversion module: The code is introduced to unify and simplify the standard format, replacing some textual descriptive public information in the form of code to make the format more concise and clearer.
(f) Standard format conversion module: According to the standardized format configuration file, the data format and element processing method are imported as parameters by using dynamic reading, and the system automatically parses the parameters to complete the format conversion of the file.
(g) Standardized processing monitoring module: displaying the operation status information of each link item by item in the form of a list, forming a log file to save the recorded contents by day, alarming and processing abnormal conditions, and monitoring the usage of network,memory and local file system.
The database of this system is built based on the National Marine Environment Integrated Database to realize the data loading function, providing data ranking, database output interface rules and statistical analysis methods, standard file generation, and database synchronization.
(a)Quality control configuration management module:Provide the function of adding,deleting, and modifying the configuration information of quality control elements,methods,and parameters in the management interface, and realize the free selection of quality control methods.
(b) Automatic quality control module: Based on the configuration management module and the interface of each quality control method module, building an automated self-measurement control system entity to achieve the following functions: the ability to freely select QC modules for quality control; the ability to easily add new QC modules;automatic QC process logging and recording in the database; ability to set background parameter sets; ability to set QC parameters flexibly; ability to set QC buoys or profiles for manual QC. Each quality control method has independently developed modules that can operate independently and are reusable, and multiple quality control method modules can be combined for quality control of different data and elements.
(c) Visual human-computer interaction audit module: Develop a database-based visual human-computer interaction audit module, combined with a visual display for data browsing audit modification, to achieve functions including providing graph, text and table linkage, modifying data or quality control characters, the corresponding graph, text, and table are updated simultaneously; the ability to flexibly retrieve the data that needs to be viewed and reviewed, with the default being the latest data; automatic logging of change records and their preservation in the database;the ability to display all profiles of the same buoy; the ability to display observation profiles in the vicinity of the profile for comparison;the ability to save or undo changed records; graphical drawing including station, track,profile and waterfall charts;ability to perform graphical saving;logging of audit operations.
Develop a data deduplication subsystem based on a database. Using both data and metadata information, the information is ranked according to the ranking rules, and single-source and multi-source data integration is carried out for the ranked data.The data from each source is processed into a unified variable name and unified format as the basic data set of each source.After the basic data set is completed,its use is not affected by the source format, element differences, or data, and all directions of hydrology and meteorology can easily retrieve multi-source information to carry out multi-source data integration, hydrology and meteorology element data sets, comprehensive data sets, and statistical analysis and other data productions.
The module can display the duplicated data and carry out manual judgment, while the floating data can be selectively displayed to judge the trade-off of certain data and update the ranking results to the base database. Among them, the data ranking configuration can specify the type of data to be ranked and the data ranking fields.For the ranked data to identify and indicate the source of the data, the data can be manually filtered at a later stage,and determine the data items to be retained.
The ranked and integrated files go through the data output subsystem to generate two types of standard formats. One type is the easy-to-read ASCII file,which is a common buoy information storage format customized by NMDIS with a simple structure and clear and easy-to-read for sharing in national and international releases. Another type of full element NetCDF format file designed for NMDIS is binary storage, with observation elements including global, variance, hydrological, meteorological, biochemical, and other types, totaling more than 200 elements and 140 kinds of metadata information, which can solve the problems of inconsistent data elements of multi-source, multi-type and multi-sensor observations and large data storage space occupation.
The system realizes a seamless connection with the National Marine Environment Integrated Database. The National Marine Environment Integrated Database has integrated domestic observation, monitoring, special data, etc. International buoy data is an inseparable and important part of the operational observation,and the result data will be loaded into the National Marine Environment Integrated Database to complete the connection with the platform and realize the operationalization of the centralized application service of the data.
(a) Data query and display module: build a data query and display module based on the international buoy comprehensive database and GIS platform, which can easily query data, locate buoys and display buoy trajectories in the main interface of the system,including data query, metadata information statistics, and extraction, buoy positioning,drifting buoy trajectory tracking and other functions.
(b)Data Statistics Module:Based on the data query and display,statistical analysis is conducted for the data products of international buoys. Form relevant graphical analysis products and GIS related analysis models, and can export the results according to the demand.According to the type of elements, spatial and temporal distribution, etc. to carry out annual and daily,annual and monthly,cumulative annual and daily, cumulative annual and monthly statistical analysis, and according to the requirements of the output statistical analysis data set,graphic visualization display.It can also be used to assist in the analysis of different dimensions of statistical analysis products through the technical means of GIS overlay analysis,spatial analysis,and equal surface analysis.
(c) Data monitoring module: Calling the international buoy-related library table structure of the National Marine Environment Integrated Database, based on the basic database and GIS platform, using the station distribution map, time series map, spatial distribution map,and data tables to monitor and visualize the downloaded real-time data in time; using the set range, extreme values, and other parameters to achieve the visualization of abnormal data warning and comparison analysis with the normal data.
The system tackles the technology of multi-source heterogeneous data collection and data processing based on cloud architecture. Aiming at the characteristics of the multi-source nature of international buoy data, a multi-source data collection engine is developed and established to overcome the shortcomings of existing technologies and provide a multi-tasking and multi-threading data collection method based on the cloud platform,which can precisely locate data sources such as webpage,FTP,email and mobile APP terminal, and solve the problems of real-time ocean data collection, quality control,integration and standardization under complex background. By adopting a flexible configuration policy and multi-tasking mode,it realizes the model of separation of collection rules and running instances and automated data collection, which can eliminate the need for personnel to be on duty and flexibly fit the business scenarios.It also implements HTTP sniffing technology and breakpoint mining technology for dynamic web pages and encrypted pages, without human involvement, laying a good foundation for machine crawling. In addition, the data collection engine supports distributed cloud deployment,load balancing, and data shunting for the huge characteristics of real-time/delay-time data in the ocean, effectively reducing the bottleneck of server storage and computing. At the same time,based on the data characteristics of the output data set with multiple sources of heterogeneity,the system supports a variety of data processing rules.The system applies processing rules to collected data, data that can achieve data pre-processing and data quality control processing, realizing scalability, personalized configure ability, and outputting multiple types of data formats on demand.
A machine learning-based method for fast data ranking and integration was further developed. The traditional full-volume cyclic data ranking model is time-consuming and computationally inefficient, and the precise ranking has small spatiotemporal and data error detection difficulties and a high false detection rate for a large range of threshold ranking. Based on Bloom Filter technology, we establish a full-ranking model, a micro-ranking model, and a small spatiotemporal model to perform full-ranking and optional field composite ranking for massive data, and remove redundancy and multi-level repetition to ensure data uniqueness and validity.Meanwhile, for metadata, based on NLP knowledge mapping technology,the "entity-relationship-attribute"data chain is established through machine learning and sample training, and metadata elements can be extracted from structured, semi-structured, and unstructured data through knowledge extraction technology. Through knowledge fusion, the information tree can be formed by eliminating the ambiguity between entities, relationships, attributes, and other denotative terms and factual objects,and by calculating the shortest or full path between multiple entities through association analysis query,association path query,index calculation query,calculating the association rate of target entities, touch black rate and other indexes through appropriate reasoning and judgment methods to find the relationship between entities and then complete the integration of data. Through practice in the face of billions of massive data through big data-related technology can efficiently improve the data detection, integration,and analysis capabilities, promoting the data model, and business model to improve the establishment,improve work efficiency,and related decision support capabilities.
An international buoy data operationalization opertaional system is developed in this study. The operational system runs stably in NMDIS, realizes real-time collection and automatic processing of international buoy data, is able to be perfect and applicable, and has achieved good results. The system adopts a modular structure, and the modules of data acquisition, quality control, and ranking weight are highly customizable and can be configured with templates according to data characteristics and user needs, achieving compatibility with future data. At the same time, it effectively utilizes the resources of the National Marine Integrated Database, realizes the operational flow of international buoy data, and achieves the purpose of operationalizing the centralized application service of data. The overall architecture, business model, and data architecture of the system are reasonably designed,easy to expand and improve, and the system is stable and scalable,with broad application prospects.
Marine Science Bulletin2023年1期