What is Energy Data, and how do we handle it?

Shamim Ahmed, Martin Möhring & Kolja Schumacher

Introduction

Being part of a research project on the Decentralisation and Digitalisation of Energy necessarily brings you in contact with various forms of data. Smart Meters produce data about energy consumption of individual households, digital meters on transformer stations track load flows of electricity, prediction models use data of the past to calculate energy production of the future, which is then fed into algorithms that steer central grid control systems. We encounter data that is handled and interpreted by humans, for example, in a distribution system operator’s (DSO) control room, as well as data entirely processed automatically by algorithms and machines, particularly in energy trading.

All of these different data types are connected through what is commonly referred to as the “smart grid.” By many in the industry and politics, the smart grid is imagined as the solution to the coordination problem in the twin transition of decarbonization and digitalization (e.g.Fraunhofer ESK/accenture consulting 2017). The consensus holds that for the smart grid to work, it needs consistent data flows. Yet this raises important questions: What kinds of data are we actually referring to?Do different disciplines mean the same thing when they speak of “energy data”? And even if we use the same term, do we interpret and value these data in the same way?

In this blog post, we are not trying to ontologically define what data is, but rather empirically ask: How do we analyze, perceive, and assess energy data within our respective research projects? For the purpose of this text, we define energy data broadly as data that is produced and used within the energy system by producers, transmission and distribution system operators, traders, and consumers. In the following sections, we will take a look at data from different perspectives (Engineering, Sociology, Computer Science) to then discuss where our perspectives converge or diverge and reflect on what this means for an interdisciplinary research approach to digitalised energy systems.

Engineering

The engineering group focuses on implementing a probabilistic prediction model, which can be used to predict energy generation and consumption on the side of the DSO. It will allow DSOs to make risk-aware operational decisions. This is important for the twin transition because the probabilistic model enables DSOs to digitally manage the uncertainty and complexity of heterogeneous Distributed Energy Resources (DERs). Hence, a large-scale integration of renewable energy is possible. Several types of DERs like wind, biogas, or battery storage systems are used in our model. The heterogeneous nature of the model makes the data handling complex. We use deep learning methods, mainly the variants of the Recurrent Neural Network (RNN), and Long-Short Term Memory (LSTM) for the prediction model. That means we must think about the restrictions of the data behaviour for these variants.

The challenge is in this context, not really acquiring data; it’s handling it regarding its formats, quality, sampling,etc., across the entire grid ecosystem. We are interested in wind power, photovoltaic, battery storage systems, EVs, and biogas systems. Hence, data handling regarding these DERs is important.

Since different DER measurements are recorded at unequal time intervals (e.g., seconds, minutes, or quarter-hourly), the early challenge is the irregular sampling frequency of the dataset for different types of DERs. Even when we collect data from a reliable database, this issue exists. This matters because the methods we are working with require equal sampling. In the case of irregular datasets, a solution might be upsampling or downsampling. However, accuracy might be affected because resampling requires interpolation and statistical estimation. Besides energy generation, the prediction model also aims to forecast energy consumption. The consumer load profile dataset should have the same sampling as the generation dataset because the LSTM algorithm assumes a fixed temporal spacing between consecutive observations. Each time-step is treated equally spaced in time, and the model implicitly learns temporal dependencies based on this assumption.

The next question arises, from where do these data come from? Is it real data or synthetic data? Anomalies in real data are not new. Data privacy, accuracy, and time factor issues are the most common that need to be considered. Some open-source databases are available with sufficient accuracy, along with their limitations. The National Renewable Energy Laboratory (NREL) openly distributes high spatial and temporal resolution data for several types of DERs, but mostly US-based. An alternative source might be Open Power System Data (OPSD). This well-structured time-series data contains only the region of Europe but is limited if we handle battery storage systems and EVs. The Sandia National Labs PV Data can be considered a well-accepted data source, with one of the limitations being US-based. The EU Open Data Portal is one of the reliable sources that covers all EU countries, with one of the limitations being low temporal resolution for many datasets. Besides this, the peer-reviewed PANGAEA dataset might be another option, which is an open-access data repository that archives and publishes data in the field of Earth and Environmental Science (Felden et al., 2023). The dataset in PANGAEA generally contains both measurements (i.e., numerical) and observations (i.e., qualitative). However, most of the archived dataset has extremely high resolution (1 Hz), which requires, in many cases, downsampling. Furthermore, the dataset might have unfamiliar and unwanted columns, which may lead to large pre-processing tasks (“data cleaning”). Regardless of the above-mentioned databases, the dataset might contain negative values, which should be handled carefully. For example, the measurement of the meteorological dataset (Kalisch et al. 2015) contains negative irradiation values. This issue must be fixed. The alternative way is to use synthetic data. In case we simulate the models to generate synthetic data, the question arises as to which tool we should use. The possibility of open-source and other factors must be considered. Besides this, the simulation should be accurate to achieve high-quality data, which we intend to use as features in the prediction model.

Sociology

For someone looking at the social and cultural side of the grid, in order to understand the role of data, it is essential to differentiate not only what energy data is for the stakeholders involved but also what kinds of work it enables in the energy system and who works the data (electrical engineers, data scientists, programmers, entrepreneurs, ...). As Susan Leigh Star showed, infrastructure is never neutral and must be analyzed through the routines and classifications that underpin it (Star 1999). This applies equally to energy data, which sustains the everyday functioning of the modern electricity grid while simultaneously enabling new economic and political arrangements.

We can therefore say that a first category consists of operational data, which is indispensable for real-time grid operations. This includes frequency measurements, fault signals, and transformer parameters, so basically data that allows DSOs to fulfill their traditional role to ensure stability and supply security. Through the management and analysis of these data flows, grid operators observe electricity flows and keep them at stable levels. However, as Di Silvestre et al. (2018: 486) point out, the vastly increasing amount of data produced leads to a higher complexity for infrastructural management, which can be seen in grid control empirically. Grid operators talk about the ambiguity between the need for more (real-time) data and the problems of keeping an overview of the grid on the basis of this data.

A second category is market-oriented data, generated for economic coordination in liberalized electricity markets. We see evidence that load and consumer profiles, production forecasts, flexibility values, and price signals may exist primarily so that traders, aggregators, and retailers can participate in competitive markets. These data do not stabilize the grid directly; rather, they enable market actors to project and optimize financial decisions. “Information technologies are thus not only supposed to make grids smart but also to create ‘smart markets’” (Folkers 2019: 5). Data becomes both a commodity and a condition for the commodification of electricity in the form of futures (in the economic sense).

DSOs sit at the intersection where these types of data converge. They hold and process both stability-critical and market-relevant data, making them central actors in the emerging “smart” reconfiguration of the electricity system. Contemporary policy frames often imagine DSOs as neutral “platform operators” who must deliver transparency and enable competition. Yet recent scholarship shows that DSOs are not merely technical intermediaries but politically consequential actors. Trahan and Hess (2023) argue that control over data and digital infrastructures increasingly shapes who directs the pace and pathway of energy transitions, and DSOs often possess exactly this kind of infrastructural authority. Decisions about what DSOs measure, how granular the data is, and who gains access are therefore not only technical choices but expressions of particular governance models.

Electrical grids are socio-technical systems (Geels 2004) where data becomes an integral category to be analyzed. Sadowski and Levenda (2020) describe smart energy systems as operating through an “anti-politics” logic that frames complex social decisions as matters of technical or economic optimization. This helps illuminate how smart-meter data, while helpful for detecting local overloads for example, can simultaneously advance new commercial models or raise concerns about household surveillance. Those developments are then often presented as apolitical. We must therefore ask not just what data enables, but who benefits, who is governed, and which forms of agency are foreclosed when smart grids are designed as technocratic systems of control.

Computer Science

From a computer-science perspective, the energy sector is an application context in which general data-system challenges become particularly visible. The focus here is not the substance of “energy data”, but the methods required to integrate heterogeneous sources, formalize the meaning of data and transformations, and provide verifiable guarantees about processing and use. In energy – where data increasingly drives automated operational and market decisions - interoperability, correctness under failure, auditability, and end-to-end traceability of data lineage are prerequisites for trustworthy automation.

Energy data is produced across a heterogeneous ecosystem – from smart meters and DER telemetry to grid sensors, market systems, and exogenous inputs such as weather feeds. It arrives with different formats and protocols, meanings (units, identifiers, time zones), sampling regimes, and guarantees regarding ordering, completeness, and latency. The friction of integrating these sources becomes most visible once operational and market-facing workflows are automated: temporal misalignment, inconsistent identifiers, unclear provenance, duplicates, delays, and missing intervals. As the Engineering section shows, the challenge is often not access to data but making it temporally and semantically comparable – by enforcing a consistent time base and recording any resampling steps or assumptions used when imputing missing intervals.

We turn some domain expectations into explicit, testable rules for the data-ingestion pipeline. In practice, this means that we describe the pipeline as a sequence of clearly defined steps (for example: “data arrives”, “it is checked”, “it is put into the right order”, “it is released to downstream systems”, or “it is isolated for manual inspection”). Thinking in steps helps because it lets us state requirements as simple “always/never” statements about what may happen - and then verify that the pipeline design and its actual operation follow those rules. ¹

A concrete example is ingesting exchange market data in energy trading (day-ahead/intraday prices, trades, etc.). Here, correctness is not only about the values themselves, but also about timeliness, ordering, and mapping each record to the correct instrument. A typical pipeline therefore includes stages such as:

Received: a message arrived from the exchange feed

Validated: basic checks succeeded (format, required fields, plausible ranges, signature/source, etc.)

Sequenced: the message was placed in the correct order within its instrument stream (or a gap was detected)

Published: the message is released for downstream use (analytics, trading decisions, storage)

Quarantined: the message is withheld because something is wrong or uncertain; it is retained for later inspection or recovery

The purpose of distinguishing Published vs. Quarantined is to ensure that downstream systems never “silently” act on questionable inputs, while still keeping a trace of what arrived and why it was not used.

From these steps, we can express a few core assurance rules in simple terms:

No silent loss (accountability): Everything that arrives is eventually either used or explicitly set aside. In other words, a message must not disappear without a trace: it ends up either Published (usable) or Quarantined (kept but blocked).

No unverified release (safety): Only validated (and correctly ordered) data may be published. This prevents downstream decisions from being driven by malformed, mis-mapped, or out-of-order records.

Defined handling of gaps: If a gap is detected in a stream (e.g., missing timestamps or sequence numbers), the system must follow a defined path: either Recover (e.g., fetch missing data / re-sync) or Quarantine (isolate the stream) instead of continuing as if nothing happened.

These rules can be checked in two complementary ways: first, we can review the pipeline design against them (to ensure the process structure makes violations impossible or unlikely). Second, we can continuously check the running system by observing the pipeline’s own status events (e.g., “received”, “validated”, “published”, “quarantined”) and raising alerts when the actual behavior violates the rules. In a high-stakes domain like energy, this helps catch subtle issues - such as stale timestamps, wrong unit conversions, or missing intervals - before they propagate into costly operational or market outcomes.

Returning to the introduction: from a computer science perspective, “energy data” is mainly a label for heterogeneous streams in a high-stakes domain. Our goal is to make reliable use of energy data explicit by specifying what each record means (units, identifiers, timestamps), how trustworthy it is (provenance and integrity), and which guarantees hold as it moves through software systems. Verifiable data exchange is an end-to-end assurance property: data is accounted for, its origin and integrity can be verified, its interpretation remains consistent across transformations, and its lineage is traceable so that automated decisions remain auditable. This is important in the energy sector because automated decisions increasingly depend on near-real-time data, and even small, hard-to-detect errors like stale timestamps, wrong unit conversions or missing intervals, can propagate into costly operational actions and market outcomes. Verifiable end-to-end data lineage makes such errors detectable and auditable, thereby enabling trustworthy automation and regulatory compliance.

Conclusion

Across engineering, sociology, and computer science, “handling energy data” converges on a shared concern: reliability of decisions derived from data. Engineering emphasizes fitness for modeling (sampling regularity, preprocessing, forecast calibration). Computer science emphasizes fitness for operation (integration architectures, formal contracts, runtime verification, end-to-end assurance, and observability). Sociology emphasizes fitness for governance (who gains access, how classifications embed power, and how “optimization” can depoliticize consequential choices). These are not competing goals; they are different validity criteria applied to the same data flows.

The main divergence lies in what counts as a “data problem”. For the engineers in our research group, irregular sampling or biased datasets are primary data problems; for computer scientists, the issues of schema/semantic mismatch, traceability, and enforceable guarantees dominate; for sociologists, the key questions are institutional: who benefits, who is exposed to surveillance or exclusion, and who can contest decisions.

A central insight emerging across disciplines is that data quality, neutrality, and usefulness cannot be assessed in isolation from purpose. The same data stream may be “good” for prediction, “correct” for automated processing and yet problematic from a governance perspective. Interdisciplinary work on energy data therefore requires not necessarily shared data sets, but shared reflexivity about assumptions, validation criteria, and normative implications. Viewed through this interdisciplinary lens, energy data emerges less as a stable object than as a socio-technical construct whose properties depend on the practices and purposes through which it is handled.

Formally, we can model the pipeline as a labeled transition system (LTS) and express the assurance rules as linear temporal logic (LTL) properties over the resulting event traces (e.g., received, validated, published, quarantined) (Pnueli 1977; Baier & Katoen 2008). ↩

Bibliography

Baier, C., & Katoen, J.-P. (2008). Principles of model checking. The MIT Press.

Di Silvestre, M. L., Favuzza, S., Riva Sanseverino, E., & Zizzo, G. (2018). How Decarbonization, Digitalization and Decentralization are changing key power infrastructures. Renewable and Sustainable Energy Reviews, 93, 483–498. doi.org/10.1016/j.rser.2018.05.068

Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M., & Glöckner, F. O. (2023). PANGAEA - Data Publisher for Earth & Environmental Science. Scientific Data, 10(1), 347. doi.org/10.1038/s41597-023-02269-x

Folkers, A. (2019). Smart Grids and Smart Markets: The Promises and Politics of Intelligent Infrastructures. In M. Kronberger, G. C. Bowker, J. Elyachar, A. Mennicken, P. Miller, J. Randa Nucho, & N. Pollock (Eds.), Thinking Infrastructures (pp. 255–272). doi.org/10.1108/S0733-558X20190000062016

Fraunhofer ESK & accenture consulting. (2017). Smart Grid = Connected Grid. White Paper. www.iks.fraunhofer.de/de/publikationen/whitepaper-studien/whitepaper-smart-grid-connected-grid.html

Geels, F. W. (2004). From sectoral systems of innovation to socio-technical systems. Research Policy, 33(6–7), 897–920. doi.org/10.1016/j.respol.2004.01.015

Kalisch, J., Schmidt, T., Heinemann, D., & Lorenz, E. (2015). Continuous meteorological observations in high-resolution (1Hz) at University of Oldenburg in 2014 [Dataset publication series]. PANGAEA. Carl von Ossietzky Universität Oldenburg, Germany. doi.org/10.1594/PANGAEA.847830

Pnueli, A. (1977). The temporal logic of programs. 18th Annual Symposium on Foundations of Computer Science (Sfcs 1977), 46–57. doi.org/10.1109/SFCS.1977.32

Sadowski, J., & Levenda, A. M. (2020). The anti-politics of smart energy regimes. Political Geography, 81, 102202. doi.org/10.1016/j.polgeo.2020.102202

STAR, S. L. (1999). The Ethnography of Infrastructure. American Behavioral Scientist, 43(3), 377–391. doi.org/10.1177/00027649921955326

Trahan, R. T., & Hess, D. J. (2021). Who controls electricity transitions? Digitization, decarbonization, and local power organizations. Energy Research & Social Science, 80, 102219. https://doi.org/10.1016/j.erss.2021.102219

What is Energy Data, and how do we handle it?

Shamim Ahmed, Martin Möhring & Kolja Schumacher

Bibliography

Last blog post

Energy: Digital & Decentral

Contact