• Conference
  • Engineering and Numerical Tools

On-Policy vs. Off-Policy HVAC Control: Comparing PPO and SAC–Gumbel in EnergyPlus

Conférence : Communications avec actes dans un congrès international

We compare two reinforcement learning methods for HVAC control in a university amphitheater simulated in EnergyPlus: Proximal Policy Optimization (PPO, on-policy) and SAC-Gumbel (off-policy). We run two experiments. First, a weekly adaptation test trains each agent for 50 episodes using the first week of January in Luxembourg. Second, a year-long generalization test trains on a full year of data (about 3.5× passes) and then evaluates on two other climates (San Diego and Brest) with realistic occupancy. We assess learning speed, daily average cumulative score, comfort violations during occupied periods (15-minute counts), and energy use (AHU electricity plus district heating).

In the weekly test, SAC-Gumbel achieved a better cumulative score (-449.17 vs. -501.33) and used less energy (7.16 GJ vs. 9.05 GJ) than PPO. In the year-long tests, SAC-Gumbel lowered cumulative penalty scores by 31.85% and reduced occupiedperiod comfort violations by 30.12% relative to PPO, but consumed 83.04% more energy. Overall, the off-policy method learned faster and provided stronger comfort control, at the cost of higher energy in cross-climate deployment.