This README file (version 1) was generated on 2026-05-19 by the dataset authors.

## Version history

- Version 1: Initial release of the dataset.


## Citation

If you use this dataset, please cite both the dataset and the associated scientific article as follows:

### Dataset citation

APA:  
Martín-Fraile, J. V., Basurto, N., Sierra-García, J. E., & Herrero, Á. (2026). *DAYPSCI v1: An Event-Based Dataset of Fault Injection Scenarios in PLC-Controlled Industrial Cyber-Physical Systems(Version 1)* [Dataset]. Universidad de Burgos. https://doi.org/XXXXXXXXX (to be assigned) 

BibTeX:
@dataset{DAYPSCPI_v1,
  author = {Martín-Fraile, Juan Vicente and Basurto, Nuño and Sierra-García, Jesús Enrique and Herrero, Álvaro},
  title = {DAYPSCI v1: An Event-Based Dataset of Fault Injection Scenarios in PLC-Controlled Industrial Cyber-Physical Systems},
  year = {2026},
  publisher = {Universidad de Burgos},
  version = {1},
  doi = {XXXXXXXXX (to be assigned)},
  note = {[Dataset]}
}


GENERAL INFORMATION
-------------------
1. Title of dataset: DAYPSCI v1: An Event-Based Dataset of Fault Injection Scenarios in PLC-Controlled Industrial Cyber-Physical Systems

2. Authorship

Name: Juan Vicente Martín-Fraile
Institution:  Grupo de Investigación en Automatización, Robótica, Control y Optimización (ARCO), Departamento de Digitalización, Escuela Politécnica Superior, Universidad de Burgos, Av. Cantabria s/n, 09006 Burgos, Spain.
Email: jvmartin@ubu.es
ORCID: https://orcid.org/0009-0009-2099-7251

Name: Nuño Basurto
Institution: Grupo de Inteligencia Computacional Aplicada (GICAP), Departamento de Digitalización, Escuela Politécnica Superior, Universidad de Burgos, Av. Cantabria s/n, 09006 Burgos, Spain.
Email: nbasurto@ubu.es
ORCID: https://orcid.org/0000-0001-7289-4689

Name: Jesús Enrique Sierra-García
Institution:  Grupo de Investigación en Automatización, Robótica, Control y Optimización (ARCO), Departamento de Digitalización, Escuela Politécnica Superior, Universidad de Burgos, Av. Cantabria s/n, 09006 Burgos, Spain.
Email: jesierra@ubu.es
ORCID: https://orcid.org/0000-0001-6088-9954

Name: Álvaro Herrero
Institution: Grupo de Inteligencia Computacional Aplicada (GICAP), Departamento de Digitalización, Escuela Politécnica Superior, Universidad de Burgos, Av. Cantabria s/n, 09006 Burgos, Spain.
Email: ahcosio@ubu.es
ORCID: https://orcid.org/0000-0002-2444-5384

DESCRIPTION
-----------
1. Dataset language
English

2. Abstract

The dataset contains time-series data collected from an industrial cyber-physical system (CPS) based on a PLC-controlled part marking station using Siemens S7-1200 and S7-1500 devices. Data acquisition follows an event-based logging approach, where changes in system variables are recorded together with their associated duration (Δt), enabling precise temporal characterization while reducing redundancy.
In addition to temporal information, the dataset explicitly represents scan-level execution by incorporating identifiers of PLC scan cycles (scan_id) and the relative order of events within each scan (event_order). This allows accurate representation of multiple events occurring within the same control cycle and preserves the logical execution order of the system.
The dataset includes both normal operation and fault conditions generated through controlled fault injection, specifically targeting sensors and actuators (e.g., solenoid valves). Ground truth labels are derived from the experimental configuration provided to the control system and embedded during data acquisition, ensuring consistency between system behavior and annotation.
Data are organized into independent experimental batches, each corresponding to a specific operating condition. Each batch includes processed event-based data (CSV), raw network traffic captures in PCAPNG format, including industrial PROFINET communication traffic, and supporting documentation, enabling traceability and reproducibility.
The dataset is designed to support the development, training, and evaluation of machine learning models for anomaly detection, fault classification, and industrial cybersecurity applications, while also enabling detailed temporal and logical analysis of discrete-event industrial processes.

3. Keywords
event-based logging, digital twin, ground truth labeling, discrete-event systems, PLC scan cycles, industrial cybersecurity, fault injection, sensor and actuator faults

4. Date of data collection
May 2026

5. Date of dataset publication
May 2026

6. Funding
The funding for this work was provided by AI4SECIoT project ("Artificial Intelligence for Securing IoT Devices" - C032.23), funded by the National Cibersecurity Institute (INCIBE), derived from a collaboration agreement signed between the National Institute of Cybersecurity (INCIBE) and the University of Burgos. This initiative is carried out within the framework of the Recovery, Transformation and Resilience Plan funds, financed by the European Union (Next Generation), the project of the Government of Spain that outlines the roadmap for the modernization of the Spanish economy, the recovery of economic growth and job creation, for solid, inclusive and resilient economic reconstruction after the COVID19 crisis, and to respond to the challenges of the next decade.

7. Geographic location/s of data collection
Universidad de Burgos, Av. Cantabria s/n, 09006 Burgos, Spain


ACCESS INFORMATION
------------------
1. Dataset Creative Commons License
Creative Commons Attribution 4.0 International (CC BY 4.0)


2. Dataset DOI
(to be assigned)

3. Related publication
Martín-Fraile, J. V., Basurto, N., Sierra-García, J. E., & Herrero, Á. (2026).*DAYPSCI v1: Event-Based Dataset for Anomaly Detection through Fault Injection in PLC-Controlled Industrial Cyber-Physical Systems*. (under preparation)


METHODOLOGICAL INFORMATION
--------------------------
The dataset has been generated using an industrial cyber-physical system (CPS) based on a PLC-controlled part marking station implemented with Siemens S7-1200 and S7-1500 devices. The system integrates both real industrial hardware and virtual simulation components through digital twin concepts, including PCSIMU and other simulation tools, enabling the execution of controlled and repeatable experimental scenarios.

Data acquisition follows an event-based logging approach, where changes in system variables (sensor states and actuator control signals) are recorded instead of using fixed sampling rates. Each event is associated with its corresponding duration (Δt), enabling precise temporal characterization of the industrial process while reducing data redundancy and preserving the dynamic behavior of discrete-event systems.

In addition to temporal information, the dataset incorporates scan-level identifiers (scan_id) and the relative order of events within each scan (event_order). This allows an explicit representation of events that occur within the same PLC scan cycle, preserving the logical execution order of the control system and resolving ambiguities associated with simultaneous events (Δt = 0).

The experimental campaign includes both normal operating conditions and anomaly scenarios generated through controlled fault injection. Specifically, two main types of anomalies are considered:
- Sensor anomalies (e.g., incorrect or missing signals)
- Solenoid valves anomalies (e.g., malfunctioning solenoid valves)

Data collection follows a structured methodology aligned with industrial operation modes. Baseline (normal operation) cycles are first recorded to establish reference behavior. Subsequently, anomalies are introduced while monitoring system transitions according to GEMMA states (e.g., transitions between automatic operation, fault states, and recovery states such as F1, D2, A5, A6, and A1), capturing fault occurrence, system recovery, and return to normal operation.

To ensure consistency and reliability of annotations, ground truth labels are derived from the experimental configuration provided to the industrial control system. The control system receives this configuration from a higher-level planning layer and embeds the corresponding contextual information during data acquisition.
This approach enables a clear separation between the CPS and the experimental labeling layer, while maintaining consistency between system behavior and annotation and avoiding interference with the industrial process.

The dataset is organized into sequences corresponding to individual processed parts. Each sequence consists of ordered records including binary states of sensors and solenoid valves control signals, system state (GEMMA mode), event-based timing (Δt), and scan-level execution information. This structure enables direct applicability to machine learning tasks such as anomaly detection, fault classification, and temporal pattern recognition in industrial environments.

This methodological design ensures reproducibility of experimental scenarios and provides a reliable benchmark for evaluating machine learning approaches in industrial anomaly detection and cybersecurity, while capturing both temporal and logical characteristics of industrial processes.


FILE INFORMATION
-----------------
The dataset is organized as a collection of independent experimental files, each corresponding to a specific operating condition or fault scenario (see FILE NAMING AND ORGANIZATION CONVENTION).

Each file represents a complete data acquisition under controlled and consistent experimental conditions, ensuring comparability across different scenarios.

All processed data files are provided in CSV format (.csv).

Each sequence represents one processed part in the system and follows an event-based data structure.

Within each file, the dataset is structured as follows:

- Each row represents a system event (i.e., a change in one or more variables).
- Each sequence (part) is composed of an ordered set of events.
- Sequences are identified using the field `part_id`.

Each row includes the following fields:

Event-related variables:

- event_seq: Sequential identifier of the event within each part.
- ts_plc: Timestamp generated by the PLC.
- delta_time_ms: Time difference in milliseconds (Δt) since the previous event.
- scan_id: Identifier of the PLC scan cycle in which the event occurred.
- event_order: Relative order of the event within the same scan cycle.
- event_id: Identifier of the event type.
- event_type: Category of the event.
- edge: Type of signal transition (e.g., rising or falling edge).

Process variables:

- gemma: System state based on the GEMMA reference model, using a numerical encoding defined for this dataset (e.g., 1 = automatic operation, 5 = preparation, 7 = fault).
- part_id: Identifier of the processed part (sequence ID).

Labeling fields (high-level):

- anomaly:
    - 0 → no anomaly
    - 1 → sensor anomaly
    - 2 → solenoid valve anomaly
    - 3 → combined anomaly (sensor + solenoid valve anomaly)
    - 4 → propagated anomaly (effect of anomalies occurring in other parts or prior system states)

- fault:
    - 0 → no fault
    - 1 → fault occurred

Labeling fields (detailed fault descriptors):
These variables provide fine-grained annotations describing the affected component and fault type, complementing the high-level anomaly labels.

- FSx, TFSx: Indicators of injected faults in sensor x and the corresponding fault type. FSx identifies the affected sensor, while TFSx specifies the type of fault applied.

- FCVx, TFCVx: Indicators of injected faults in actuator x (e.g., solenoid valves) and the corresponding fault type. FCVx identifies the affected actuator, while TFCVx specifies the type of fault applied.

Discrete state variables:

- a0, a1, b0, b1, c0, c1, B_1, B_2: Binary internal state indicators representing positions and logical states of the system components.

Actuator control signals:

- YA_p, YA_m, YB_p, YB_m, YC_p, YC_m: Control signals sent to the solenoid valves that govern the three actuators of the system (loading, marking, and ejection stages). These signals determine the extension and retraction of the corresponding cylinders.

Experiment metadata:

- batch_id: Identifier of the experimental run (linked to file-level scenario).
- experiment_id: Identifier of the experimental condition or fault injection scenario.

Notes:

- The dataset follows an event-based logging strategy; therefore, the number of rows per part varies depending on system dynamics.
- Simultaneous events within the same PLC scan cycle may present Δt = 0; in such cases, `scan_id` and `event_order` allow reconstruction of the logical execution order.
- The "anomaly" field describes the type of anomaly intentionally injected into the system, while the "fault" field indicates whether the injected anomaly results in an actual system failure.

Experimental design considerations:

- Each file corresponds to a single controlled experiment (baseline or specific fault injection scenario).
- All experiments are conducted under equivalent system conditions (hardware, communication, and control logic), enabling direct comparison of temporal features such as Δt.
- Timestamps are generated by the PLC, while external data sources (e.g., network captures) may rely on independent time references.


This dataset structure enables:

- Sequence-based machine learning modeling using part-level segmentation
- Temporal analysis of industrial processes through Δt
- Logical reconstruction of control behavior using scan-level information
- Reliable benchmarking of anomaly detection and fault classification methods in industrial cyber-physical systems
- Reproducible experiments through explicit linkage between data, metadata, and controlled experimental scenarios

This structured representation enables both temporal and logical analysis of industrial processes, supporting advanced machine learning techniques that leverage event ordering and scan-level synchronization.


FILE NAMING AND ORGANIZATION CONVENTION
--------------------------------------
The dataset follows a structured file naming convention designed to ensure consistency, reproducibility, and ease of use in both manual inspection and automated processing.

Naming convention:

    pn_<condition>_<variant>_batchXX.<format>

Where:

- pn → dataset identifier associated with the part marking system and PROFINET-based industrial communication
- condition → general operating condition:
    - baseline → normal operation without injected anomalies
    - sensor → sensor anomaly scenarios
    - valve → solenoid valve anomaly scenarios
    - combined → combined anomaly scenarios (sensor + solenoid valve)

- variant → identifier of the specific fault injection configuration (e.g., T1, T2)
- batchXX → unique identifier of the experimental run
- <format> → file format (e.g., .csv, .pcapng, .txt)

Examples:

Baseline:

    pn_baseline_batch01.csv
    pn_baseline_batch01.pcapng
    readme_baseline_batch01.txt

Sensor anomaly scenarios:

    pn_sensor_T1_batch02.csv
    pn_sensor_T1_batch02.pcapng
    readme_sensor_T1_batch02.txt

    pn_sensor_T2_batch03.csv
    pn_sensor_T3_batch04.csv
    pn_sensor_T4_batch05.csv
    pn_sensor_T5_batch06.csv
    pn_sensor_T6_batch07.csv

Electrovalve anomaly scenarios:

    pn_valve_T1_batch08.csv
    pn_valve_T1_batch08.pcapng
    readme_valve_T1_batch08.txt

    pn_valve_T2_batch09.csv
    pn_valve_T3_batch10.csv

Combined anomaly scenarios:

    pn_combined_T1_batch11.csv
    pn_combined_T1_batch11.pcapng
    readme_combined_T1_batch11.txt


EXPERIMENT ORGANIZATION
-----------------------
Each batch represents a complete and independent experimental scenario.

The dataset is organized as:

- Batch 01:
    - Baseline (normal operation without injected anomalies)

- Batches 02–07:
    - Sensor Sensor anomaly scenarios (T1–T6)

- Batches 08–10:
    - Solenoid valves anomaly scenarios (T1–T3)

- Batch 11:
    - Combined anomaly scenario (sensor + solenoid valve)


Each batch contains:

- CSV file → processed event-based dataset (features, labels, Δt, and scan-level information)
- PCAPNG file → raw network traffic capture
- TXT file → description of the experimental setup and anomaly configuration


DESIGN PRINCIPLES
-----------------
The dataset design follows the principle:

    One file = one experiment = one controlled fault injection scenario


METHODOLOGICAL ADVANTAGES
-------------------------
This organization provides several key advantages:

1. Logical consistency

All file names follow a consistent semantic structure:

    domain → condition → variant → batch

This facilitates:
- automated dataset parsing
- efficient filtering and selection of experimental scenarios
- reproducible experimental workflows

2. Scalability

The naming convention is extensible to future scenarios without introducing structural inconsistencies, allowing the dataset to grow while maintaining a coherent structure.

3. Metadata decoupling

File names provide a human-readable description of the experiment, while the dataset itself includes explicit metadata fields:

- batch_id
- experiment_id

This ensures that data processing and analysis pipelines do not depend on file names, improving robustness and reproducibility.

4. Temporal comparability

All batches are generated under consistent experimental conditions:

- same industrial setup
- same communication configuration
- same GEMMA-based control logic

Therefore, temporal features such as Δt are directly comparable across files.

Any variations in:

- event timing
- sequence structure
- system behavior

can be attributed to controlled fault injections rather than experimental variability.

5. Benchmark suitability

The dataset structure supports rigorous benchmarking of machine learning models under controlled industrial anomaly scenarios, including:

- baseline vs anomaly comparison
- comparison across anomaly types (sensor vs solenoid valve)
- temporal modeling and sequence-based learning