Linux Thermal Management: Fix Laptop Overheating

Laptop overheating on Linux is rarely one single bug. In practice, it is a stack problem: firmware behavior, kernel power-state policy, CPU governor choices, discrete GPU runtime power, and plain old dust in the heatsink all interact. The good news is that Linux gives you excellent visibility into every layer of that stack. If you approach thermal issues in a structured order, you can usually cut sustained temperatures by 8 to 20 C, reduce fan noise, and improve battery runtime without making your laptop feel slow.
This guide is written as a full workflow, not a random list of tweaks. You will start with prerequisites and baseline measurements, classify the severity of the problem, and then apply fixes in the right order: software first, firmware/kernel next, and hardware maintenance last when needed.
Prerequisites and Baseline
Before changing anything, make sure your platform is capable of sane thermal control under Linux. If prerequisites are not met, you can spend hours tuning userspace tools while the real issue is an unsupported firmware path.
Minimum baseline for this guide:
- Kernel: 6.8+ recommended, 6.10+ preferred for newer Intel/AMD mobile platforms
- Userspace:
lm-sensors,powertop,turbostat(Intel),nvtoporradeontop(GPU visibility) - Firmware: latest BIOS/UEFI from vendor
- Power profile service: choose one primary CPU policy tool (
tlporauto-cpufreq) to avoid conflicts
Install core diagnostics:
# Debian/Ubuntu
sudo apt update
sudo apt install -y lm-sensors powertop linux-tools-common linux-tools-generic \
stress-ng htop iotop nvtop
# Arch
sudo pacman -Syu --needed lm_sensors powertop linux-tools stress-ng htop iotop nvtopRun sensor detection once:
sudo sensors-detect --auto
sensorsCapture a baseline so you can prove improvements later:
# 5-second interval, 24 samples (~2 minutes)
watch -n 5 'sensors; echo; cat /sys/class/thermal/thermal_zone*/temp 2>/dev/null'Record:
| Metric | Idle (plugged in) | Load (10 min compile/stress) | Notes |
|---|---|---|---|
| CPU package temp | |||
| GPU temp | |||
| Fan RPM | |||
| CPU package power (W) | |||
| Throttling observed | Yes/No | Yes/No |
If you skip this, you will not know whether a tweak helped or just changed fan acoustics.
Severity Triage: Mild, Moderate, Severe
Treat thermal work like incident triage. Not every machine needs undervolting or repasting.
| Severity | Typical symptoms | Likely class of fix | Start here |
|---|---|---|---|
| Mild | Warm chassis, occasional fan ramps, no throttle | Software policy tuning | TLP or auto-cpufreq, governor/power profile checks |
| Moderate | Frequent loud fans, sustained 90 C+ CPU under routine workloads | Mixed software + firmware | CPU boost limits, kernel params, GPU runtime PM, fan curve tuning |
| Severe | Immediate thermal throttle, shutdowns, hot at idle | Hardware intervention required | Dust cleanup, paste/pad replacement, fan integrity, heatsink seating |
A common mistake is applying severe-level hardware changes to a mild-level issue, or the opposite: endlessly tweaking kernel parameters on a laptop with a nearly blocked heatsink.
Why Linux Laptops Overheat (The Root Causes)
Linux laptops run hotter than Windows on some models because OEMs often ship vendor-specific thermal logic that is better integrated with their own Windows control stack. On Linux, generic kernel interfaces usually work well, but edge cases appear when ACPI tables, WMI hooks, or fan controllers are exposed only partially.
The most common root causes are:
- ACPI/WMI feature gaps: thermal zones exist, but vendor fan and boost controls are incomplete
- Poor C-state residency: CPU fails to spend enough time in deep idle states
- Aggressive turbo behavior on AC power: sustained high clocks beyond the cooling system’s realistic envelope
- dGPU staying active: NVIDIA/AMD discrete GPU never fully enters low-power runtime state
- Suspend/resume regressions: post-resume power policy drifts into high-heat behavior
Use turbostat to confirm whether your CPU is idling correctly. If package C-states are shallow and package power remains elevated at idle, your thermal issue begins before you launch a single app.
sudo turbostat --Summary --show Core,CPU,Avg_MHz,Bzy_MHz,PkgWatt,PkgTmp,POLL,Busy%,C1%,C6%,C10 -i 2On healthy Intel mobile systems, idle C10 residency should be significant during true idle. If C10 is near zero and package power stays high, investigate background wakeups, governor policy, and kernel power parameters.
Diagnosing Your Thermal Problem
Diagnosis is the highest-leverage phase. You are trying to answer three questions:
- Which component is generating most heat (CPU, GPU, storage, VRM)?
- Is the heat caused by load, policy, or hardware transfer inefficiency?
- Does the issue appear all the time, or only after events like suspend/resume?
Start with sensor and process visibility. Useful tools here include htop
and iotop
:
# CPU, board, and fan sensors
sensors
# CPU load and runaway processes
htop
# Per-process disk activity (can cause CPU package wakeups)
sudo iotop -oPa
# GPU utilization and memory clocks
nvtopFor thermal zones:
for z in /sys/class/thermal/thermal_zone*; do
echo "$(basename "$z") type=$(cat "$z/type" 2>/dev/null) temp_mC=$(cat "$z/temp" 2>/dev/null)"
doneRun powertop calibration once on AC and once on battery. It helps identify devices that refuse low-power states.
sudo powertop --calibrate
sudo powertopThen perform a controlled load test with stress-ng
:
# 5-minute CPU stress for thermal response curve
stress-ng --cpu "$(nproc)" --cpu-method matrixprod -t 300Watch clock and throttle behavior during the run:
watch -n 2 'grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq | head -n 12; echo; sensors'If clocks collapse early while package temperature spikes rapidly, you likely have either over-aggressive boost settings or insufficient cooling transfer (dust/paste/fan curve).
Example Terminal Output: Before and After
Before tuning (common pattern):
CPU Package: 98.0 C
PkgWatt: 39.5 W
Fan: 5300 RPM
scaling_cur_freq (avg): 1.2 GHz
Thermal throttle: activeAfter policy tuning and cleanup:
CPU Package: 84.0 C
PkgWatt: 26.0 W
Fan: 3900 RPM
scaling_cur_freq (avg): 2.4 GHz
Thermal throttle: inactiveLower temperature with higher sustained frequency is exactly what you want.
TLP: The Comprehensive Power Management Suite
TLP remains the most complete laptop power-management toolkit for Linux because it configures more than CPU governors. It also handles PCIe ASPM, runtime power management for PCI devices, radio devices, and several battery-related behaviors.
Install and enable:
# Debian/Ubuntu
sudo apt install -y tlp tlp-rdw
sudo systemctl enable --now tlp
# Arch
sudo pacman -S --needed tlp
sudo systemctl enable --now tlpKey settings in /etc/tlp.conf for thermal control:
CPU_SCALING_GOVERNOR_ON_AC=powersave
CPU_BOOST_ON_AC=1
CPU_SCALING_MAX_FREQ_ON_AC=85%
CPU_SCALING_GOVERNOR_ON_BAT=powersave
CPU_BOOST_ON_BAT=0
CPU_SCALING_MAX_FREQ_ON_BAT=60%
PCIE_ASPM_ON_AC=default
PCIE_ASPM_ON_BAT=powersupersave
RUNTIME_PM_ON_BAT=auto
WIFI_PWR_ON_BAT=onNotes:
CPU_BOOST_ON_BAT=0usually gives the biggest immediate heat and battery improvement on mobile CPUs.- Limiting AC max frequency to 80-90% often removes thermal spikes with small real-world performance impact.
- PCIe ASPM and runtime PM can reduce idle platform power by several watts, which directly lowers baseline temperatures.
Check status and applied policy:
sudo tlp-stat -s
sudo tlp-stat -p
sudo tlp-stat -tIf your laptop includes vendor daemons that also manipulate power states, review for overlaps before blaming TLP.
auto-cpufreq: Intelligent CPU Governor Switching
auto-cpufreq focuses on CPU behavior and can be excellent when your workload shifts frequently between light and heavy bursts. It monitors temperature, battery status, and load, then adjusts governor and boost policy dynamically.
Install and test in monitor mode first:
# Snap path
sudo snap install auto-cpufreq
# Dry-run monitoring
sudo auto-cpufreq --monitorEnable daemon mode once behavior looks sensible:
sudo auto-cpufreq --installExample /etc/auto-cpufreq.conf:
[charger]
governor = powersave
turbo = auto
[battery]
governor = powersave
turbo = neverImportant conflict rule:
- Do not run TLP CPU scaling logic and auto-cpufreq CPU logic simultaneously without a clear split.
- Common stable setup:
auto-cpufreqfor CPU policy, TLP for non-CPU subsystems.
If you combine them, document exactly which tool owns which parameters.
Kernel Parameters That Matter for Thermals
Kernel boot parameters are not magic fixes, but they can stabilize problematic platforms when defaults are suboptimal.
Relevant parameters to evaluate:
processor.max_cstate=<n>: limits deepest C-state; useful for diagnosing unstable deep-idle behavior, but can increase idle power if set too lowintel_pstate=passive: hands frequency selection toschedutil/governor logic instead of full activeintel_pstate; can improve predictability on some Intel systemsamd_pstate=guided: gives kernel guided control for AMD P-state behavior; often improves balance between responsiveness and heat on newer Ryzen mobile CPUs
Temporary test via GRUB edit at boot, then persist only if beneficial:
# Example persistent change in /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_pstate=guided"
sudo update-grub
sudo rebootValidation is mandatory after each change. Compare baseline temperature, package power, and fan behavior before keeping any kernel parameter long-term.
Undervolting Intel and AMD CPUs
Undervolting lowers power draw at a given clock by reducing voltage headroom. It can be highly effective, but stability testing is non-negotiable.
Intel path
Use intel-undervolt
where firmware allows it.
sudo apt install -y intel-undervolt
sudo nano /etc/intel-undervolt.confTypical starting points (conservative):
undervolt 0 'CPU' -70
undervolt 1 'GPU' -40
undervolt 2 'CACHE' -70
undervolt 3 'SYSTEM_AGENT' 0
undervolt 4 'ANALOG_IO' 0Apply and test:
sudo systemctl enable --now intel-undervolt
stress-ng --cpu "$(nproc)" --cpu-method all -t 600AMD path
For Ryzen mobile APUs, power-limit tuning with ryzenadj
is often more practical than classic undervolt sliders.
# Example conservative cap values (hardware-specific)
sudo ryzenadj --stapm-limit=18000 --fast-limit=22000 --slow-limit=20000These values (mW) reduce sustained package power and therefore temperature.
BIOS lock caveat
Many laptops block undervolting because of Plundervolt -era mitigations. Check logs:
dmesg | grep -Ei 'undervolt|voltage|plundervolt'If blocked at firmware level, do not chase software-only workarounds that compromise security.
GPU Thermal Management (NVIDIA and AMD)
On hybrid laptops, the discrete GPU is frequently the hidden thermal source. You can have low CPU load and still run hot because the dGPU is stuck in a high-power state.
NVIDIA
Set power limits where supported with nvidia-smi
:
nvidia-smi --query-gpu=name,power.draw,power.limit,temperature.gpu --format=csv
sudo nvidia-smi -pl 55For hybrid graphics, use profile switching tools (envycontrol
, optimus-manager
) to force integrated mode when dGPU is unnecessary. Confirm runtime state after switching.
AMD dGPU
Inspect and set power performance mode:
cat /sys/class/drm/card0/device/power_dpm_force_performance_level
echo auto | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_levelUse low or auto depending on workload and responsiveness needs.
Hybrid mode sanity check
- Integrated-only for office/dev battery workflows
- Hybrid for occasional acceleration
- Discrete-only for sustained gaming/rendering with cooling headroom
If your dGPU temperature remains elevated at idle, check for compositor, browser, or media pipeline accidentally pinning it active.
Vendor-Specific Tooling: ThinkPad, MSI, ASUS
Generic Linux power tools are not always enough. Vendor-specific controllers can unlock fan curve and platform profile controls that ACPI generic paths miss.
ThinkPad: thinkfan
thinkfan allows custom fan policy based on thermal sensors. It is useful on systems where default fan hysteresis is too conservative.
sudo apt install -y thinkfan
sudo systemctl enable --now thinkfanTune /etc/thinkfan.conf carefully and test incrementally to avoid excessive fan wear.
MSI: isw
(and related MSI tooling)
MSI gaming laptops often expose extra fan/performance controls through vendor interfaces. isw can help manage model-specific capabilities where stock Linux profiles are insufficient.
ASUS: asus-wmi and ecosystem tools
The asus-wmi kernel module plus ASUS-focused userspace tooling can expose platform profiles and fan behavior controls on ROG/Zephyrus devices.
Verify module presence:
lsmod | grep asus_wmiIf missing functionality is vendor-specific, no amount of generic tlp tuning can fully replace it.
Sleep/Resume Thermal Issues
Some laptops behave normally after cold boot but overheat after suspend/resume. This is a distinct class of issue and should be debugged separately.
Common signs:
- Fans spin aggressively after wake with low visible workload
- Idle package power doubles post-resume
- dGPU fails to return to runtime-suspended state
Check journal around resume:
journalctl -b -1 | grep -Ei 'suspend|resume|acpi|thermal|nvidia|amdgpu|iwlwifi'Practical fixes:
- Update kernel/firmware first (many resume regressions are version-specific)
- Reapply power policy hooks on resume with
systemdsleep scripts - Restart specific services after wake if they hold high-power state
- Test
s2idlevsdeepsleep modes where firmware supports both
Example sleep hook skeleton:
# /lib/systemd/system-sleep/99-thermal-resume
#!/bin/sh
case "$1/$2" in
post/*)
/usr/bin/tlp start >/dev/null 2>&1
;;
esacMake executable:
sudo chmod +x /lib/systemd/system-sleep/99-thermal-resumePhysical and BIOS-Level Fixes
When software is tuned and temperatures are still excessive, you are likely constrained by heat transfer or airflow.
High-impact physical actions:
- Heatsink and fan cleaning: dust blockage can raise sustained temps dramatically
- Thermal paste replacement: especially valuable on laptops older than two years
- Fan integrity check: worn bearings reduce airflow even if RPM appears nonzero
- Intake/exhaust path validation: avoid soft surfaces that choke intake
BIOS/UEFI actions:
- Update firmware first
- Use balanced/cool platform profile if available
- Review fan curve controls when exposed
- Reset accidental extreme performance presets
A laptop stand that opens underside airflow can lower steady-state surface and internal temperatures with almost no cost.
Troubleshooting Matrix: Symptoms to Root Cause and Fix
| Symptom | Most likely root cause | Verify with | First fix |
|---|---|---|---|
| Hot at idle (70 C+) | Shallow C-states, background wakeups | turbostat, powertop | Governor/pstate tuning, disable noisy wake sources |
| Fans loud after resume | Resume power-policy drift | journalctl, turbostat post-wake | Resume hook to reapply TLP/CPU policy |
| CPU throttles in short bursts | Boost overshoot vs cooling limit | stress-ng + live sensors | Reduce boost/freq cap, improve fan curve |
| GPU hot while browsing | dGPU pinned active in hybrid setup | nvtop, nvidia-smi, DRM runtime state | Integrated mode, fix app forcing dGPU |
| Still throttling after tuning | Poor thermal transfer | Load test plus physical inspection | Clean heatsink, repaste, check fan health |
This table helps avoid random tweaking. Always move from measured symptom to targeted fix.
Special Cases: ARM Laptops and Data Center Hosts
Thermal guidance on x86 laptops does not map cleanly to all systems.
ARM laptops (including Asahi Linux on Apple Silicon)
Asahi uses a different thermal and power-management stack than conventional ACPI x86 laptops. Tools and interfaces differ, and x86-centric advice like intel_pstate or ryzenadj is irrelevant.
Focus on:
- Asahi-specific kernel and userspace updates
- Platform-supported performance profiles
- Workload-level optimizations (browser/video codec paths, GPU acceleration settings)
Apply documentation from the Asahi project before generic Linux laptop tweaks.
Data center and rack systems
Server thermals are primarily airflow engineering and power budgeting problems, not laptop governor problems. Use BMC/IPMI telemetry, rack airflow planning, and CPU power caps at platform level.
Use laptop-style tuning only on actual mobile platforms. On servers, optimize fan zones, inlet temperature, and workload placement.
A Safe Implementation Order
If you want a practical execution plan, use this order:
- Update BIOS/UEFI and kernel.
- Capture baseline metrics (idle, load, post-resume).
- Choose one CPU policy manager (
tlporauto-cpufreq) and tune conservatively. - Fix dGPU runtime power state if hybrid graphics are used.
- Test kernel parameter adjustments one at a time.
- Apply undervolt/power-limit tuning only after baseline policy is stable.
- If severe symptoms remain, perform physical cleaning and paste replacement.
After each step, re-measure. Thermal work is engineering, not folklore.
Conclusion
Fixing Linux laptop overheating is less about a single miracle command and more about sequencing. You diagnose first, classify severity, apply policy tuning carefully, then escalate to firmware and hardware interventions only when data shows you need them. The fastest path to a cooler machine is disciplined measurement: package power, temperatures, fan behavior, and throttle status before and after each change.
For most users, the biggest wins come from three actions: selecting a sane CPU policy strategy, ensuring the discrete GPU can truly sleep when idle, and addressing physical cooling health. Add vendor-specific tools when your platform exposes extra controls, and treat suspend/resume as its own troubleshooting branch when symptoms point there. With that method, Linux laptops can run cool, quiet, and stable even under demanding workloads.