Linux Thermal Management: Fix Laptop Overheating

Laptop overheating on Linux is rarely one bug. It’s a stack problem. Firmware, kernel power policy, the CPU governor, discrete GPU power, and plain dust in the heatsink all interact. The good news: Linux shows you every layer. Work through it in order and you can cut sustained temps by 8 to 20 C, quiet the fans, and stretch battery life without slowing the laptop down.
This guide reads as a full workflow, not a random list of tweaks. You’ll start with prereqs and a baseline, score how bad the issue is, then fix in order: software first, firmware and kernel next, hardware last.
Prerequisites and Baseline
Before you change anything, check that your platform can do sane thermal control on Linux. Skip this and you can burn hours tuning userspace tools while the real bug lives in firmware.
Minimum baseline for this guide:
- Kernel: 6.8+ recommended, 6.10+ preferred for newer Intel/AMD mobile platforms
- Userspace:
lm-sensors,powertop,turbostat(Intel),nvtoporradeontop(GPU visibility) - Firmware: latest BIOS/UEFI from vendor
- Power profile service: choose one primary CPU policy tool (
tlporauto-cpufreq) to avoid conflicts
Install core diagnostics:
# Debian/Ubuntu
sudo apt update
sudo apt install -y lm-sensors powertop linux-tools-common linux-tools-generic \
stress-ng htop iotop nvtop
# Arch
sudo pacman -Syu --needed lm_sensors powertop linux-tools stress-ng htop iotop nvtopRun sensor detection once:
sudo sensors-detect --auto
sensorsCapture a baseline so you can prove improvements later:
# 5-second interval, 24 samples (~2 minutes)
watch -n 5 'sensors; echo; cat /sys/class/thermal/thermal_zone*/temp 2>/dev/null'Record:
| Metric | Idle (plugged in) | Load (10 min compile/stress) | Notes |
|---|---|---|---|
| CPU package temp | |||
| GPU temp | |||
| Fan RPM | |||
| CPU package power (W) | |||
| Throttling observed | Yes/No | Yes/No |
Skip this and you won’t know whether a tweak actually helped or just shifted the fan noise.
Severity Triage: Mild, Moderate, Severe
Treat thermal work like incident triage. Not every machine needs an undervolt or a repaste.
| Severity | Typical symptoms | Likely class of fix | Start here |
|---|---|---|---|
| Mild | Warm chassis, occasional fan ramps, no throttle | Software policy tuning | TLP or auto-cpufreq, governor/power profile checks |
| Moderate | Frequent loud fans, sustained 90 C+ CPU under routine workloads | Mixed software + firmware | CPU boost limits, kernel params, GPU runtime PM, fan curve tuning |
| Severe | Immediate thermal throttle, shutdowns, hot at idle | Hardware intervention required | Dust cleanup, paste/pad replacement, fan integrity, heatsink seating |
A common mistake: throwing severe hardware fixes at a mild issue. The flip is just as common: endless kernel param tweaks on a laptop with a heatsink full of lint.
Why Linux Laptops Overheat (The Root Causes)
Linux laptops run hotter than Windows on some models because OEMs ship vendor thermal logic tied to their Windows stack. On Linux, the generic kernel paths usually work fine. Edge cases show up when ACPI tables, WMI hooks, or fan chips are only half-exposed.
The most common root causes are:
- ACPI and WMI feature gaps: thermal zones show up, but vendor fan and boost controls are partial
- Poor C-state residency: the CPU doesn’t spend enough time in deep idle
- Pushy turbo on AC power: sustained high clocks beyond what the cooler can shed
- dGPU staying active: the NVIDIA or AMD discrete GPU never fully enters its low-power runtime state
- Suspend and resume regressions: post-resume power policy drifts into hot behavior
Use turbostat to check that your CPU is idling right. If package C-states stay shallow and package power stays high at idle, your heat problem starts before you launch a single app.
sudo turbostat --Summary --show Core,CPU,Avg_MHz,Bzy_MHz,PkgWatt,PkgTmp,POLL,Busy%,C1%,C6%,C10 -i 2On healthy Intel mobile chips, idle C10 residency should be high at true idle. If C10 is near zero and package power stays up, check background wakeups, the governor, and kernel power params.
Diagnosing Your Thermal Problem
Diagnosis is the high-leverage phase. You’re trying to answer three things:
- Which part is making most of the heat (CPU, GPU, storage, VRM)?
- Is the heat from load, from policy, or from bad heat transfer?
- Does it happen all the time, or only after events like suspend and resume?
Start with sensor and process visibility. Useful tools here are htop
and iotop
:
# CPU, board, and fan sensors
sensors
# CPU load and runaway processes
htop
# Per-process disk activity (can cause CPU package wakeups)
sudo iotop -oPa
# GPU utilization and memory clocks
nvtopFor thermal zones:
for z in /sys/class/thermal/thermal_zone*; do
echo "$(basename "$z") type=$(cat "$z/type" 2>/dev/null) temp_mC=$(cat "$z/temp" 2>/dev/null)"
doneRun powertop calibration once on AC and once on battery. It helps you spot devices that won’t drop into low-power states.
sudo powertop --calibrate
sudo powertopThen run a controlled load test with stress-ng
:
# 5-minute CPU stress for thermal response curve
stress-ng --cpu "$(nproc)" --cpu-method matrixprod -t 300Watch clocks and throttle behavior as it runs:
watch -n 2 'grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq | head -n 12; echo; sensors'If clocks collapse early while package temp spikes fast, you’ve got pushy boost settings or weak heat transfer (dust, dry paste, bad fan curve).
Example Terminal Output: Before and After
Before tuning (common pattern):
CPU Package: 98.0 C
PkgWatt: 39.5 W
Fan: 5300 RPM
scaling_cur_freq (avg): 1.2 GHz
Thermal throttle: activeAfter policy tuning and cleanup:
CPU Package: 84.0 C
PkgWatt: 26.0 W
Fan: 3900 RPM
scaling_cur_freq (avg): 2.4 GHz
Thermal throttle: inactiveLower temp with higher sustained frequency is exactly what you want.
TLP: The Comprehensive Power Management Suite
TLP is the most complete laptop power tool on Linux. It does more than CPU governors. It also handles PCIe ASPM, runtime power for PCI devices, radios, and a few battery behaviors.
Install and enable:
# Debian/Ubuntu
sudo apt install -y tlp tlp-rdw
sudo systemctl enable --now tlp
# Arch
sudo pacman -S --needed tlp
sudo systemctl enable --now tlpKey settings in /etc/tlp.conf for thermal control:
CPU_SCALING_GOVERNOR_ON_AC=powersave
CPU_BOOST_ON_AC=1
CPU_SCALING_MAX_FREQ_ON_AC=85%
CPU_SCALING_GOVERNOR_ON_BAT=powersave
CPU_BOOST_ON_BAT=0
CPU_SCALING_MAX_FREQ_ON_BAT=60%
PCIE_ASPM_ON_AC=default
PCIE_ASPM_ON_BAT=powersupersave
RUNTIME_PM_ON_BAT=auto
WIFI_PWR_ON_BAT=onNotes:
CPU_BOOST_ON_BAT=0is the biggest fast win on heat and battery for mobile CPUs.- Capping AC max frequency at 80-90% kills thermal spikes with little real-world slowdown.
- PCIe ASPM and runtime PM can cut idle platform power by a few watts. That drops baseline temps.
Check status and applied policy:
sudo tlp-stat -s
sudo tlp-stat -p
sudo tlp-stat -tIf your laptop ships vendor daemons that also touch power states, check for overlaps before you blame TLP.
auto-cpufreq: Intelligent CPU Governor Switching
auto-cpufreq focuses on the CPU. It shines when your workload shifts a lot between light and heavy bursts. It watches temp, battery state, and load, then adjusts governor and boost on the fly.
Install and test in monitor mode first:
# Snap path
sudo snap install auto-cpufreq
# Dry-run monitoring
sudo auto-cpufreq --monitorEnable daemon mode once behavior looks sensible:
sudo auto-cpufreq --installExample /etc/auto-cpufreq.conf:
[charger]
governor = powersave
turbo = auto
[battery]
governor = powersave
turbo = neverImportant conflict rule:
- Don’t run TLP CPU scaling and auto-cpufreq CPU logic at the same time without a clear split.
- A common stable setup:
auto-cpufreqfor CPU policy, TLP for everything else.
If you do combine them, write down which tool owns which knob.
Kernel Parameters That Matter for Thermals
Kernel boot params aren’t magic fixes. They can settle down rough platforms when the defaults are poor.
Params worth testing:
processor.max_cstate=<n>: caps the deepest C-state. Handy for diagnosing flaky deep-idle behavior. Set it too low and idle power goes up.intel_pstate=passive: hands frequency choice toschedutilor the governor instead of the activeintel_pstate. Can be more predictable on some Intel systems.amd_pstate=guided: gives the kernel guided control of AMD P-states. Often a better mix of snappy response and lower heat on newer Ryzen mobile CPUs.
Temporary test via GRUB edit at boot, then persist only if beneficial:
# Example persistent change in /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_pstate=guided"
sudo update-grub
sudo rebootValidate after each change. Compare baseline temp, package power, and fan behavior before you keep a kernel param for good.
Undervolting Intel and AMD CPUs
Undervolting cuts power draw at a given clock by lowering voltage headroom. It works well. Stability testing is a must.
Intel path
Use intel-undervolt
where firmware allows it.
sudo apt install -y intel-undervolt
sudo nano /etc/intel-undervolt.confTypical starting points (conservative):
undervolt 0 'CPU' -70
undervolt 1 'GPU' -40
undervolt 2 'CACHE' -70
undervolt 3 'SYSTEM_AGENT' 0
undervolt 4 'ANALOG_IO' 0Apply and test:
sudo systemctl enable --now intel-undervolt
stress-ng --cpu "$(nproc)" --cpu-method all -t 600AMD path
For Ryzen mobile APUs, power-limit tuning
with ryzenadj
is often more practical than the classic undervolt sliders.
# Example conservative cap values (hardware-specific)
sudo ryzenadj --stapm-limit=18000 --fast-limit=22000 --slow-limit=20000These values (in mW) cut sustained package power, and so the temperature too.
BIOS lock caveat
Many laptops block undervolting due to Plundervolt -era fixes. Check the logs:
dmesg | grep -Ei 'undervolt|voltage|plundervolt'If the firmware blocks it, don’t chase software-only workarounds that weaken security.
GPU Thermal Management (NVIDIA and AMD)
On hybrid laptops, the discrete GPU is often the hidden heat source. You can have low CPU load and still run hot because the dGPU is stuck in a high-power state.
NVIDIA
Set power limits where supported with nvidia-smi
:
nvidia-smi --query-gpu=name,power.draw,power.limit,temperature.gpu --format=csv
sudo nvidia-smi -pl 55For hybrid graphics, use profile switching tools like envycontrol
or optimus-manager
. Force integrated mode when you don’t need the dGPU. Confirm the runtime state after switching.
AMD dGPU
Inspect and set power performance mode:
cat /sys/class/drm/card0/device/power_dpm_force_performance_level
echo auto | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_levelUse low or auto based on workload and how snappy you need things.
Hybrid mode sanity check
- Integrated only for office and dev battery work
- Hybrid for the odd burst of acceleration
- Discrete only for sustained gaming or rendering with cooling headroom
If your dGPU temp stays high at idle, check for compositor or driver issues , or a browser or media pipeline pinning it active by accident.
Vendor-Specific Tooling: ThinkPad, MSI, ASUS
Generic Linux power tools aren’t always enough. Vendor controllers can unlock fan curve and platform profile knobs that the generic ACPI paths miss.
ThinkPad: thinkfan
thinkfan lets you set a custom fan policy from thermal sensors. It’s useful on systems where the default fan hysteresis is too lazy.
sudo apt install -y thinkfan
sudo systemctl enable --now thinkfanTune /etc/thinkfan.conf carefully and test in small steps so you don’t grind the fan bearings.
MSI: isw
(and related MSI tooling)
MSI gaming laptops often expose extra fan and performance knobs through vendor interfaces. isw can manage model-specific features that stock Linux profiles can’t reach.
ASUS: asus-wmi and ecosystem tools
The asus-wmi kernel module plus ASUS userspace tools can expose platform profiles and fan controls on ROG and Zephyrus devices.
Verify module presence:
lsmod | grep asus_wmiIf the missing feature is vendor-specific, no amount of generic tlp tuning can replace it.
Sleep/Resume Thermal Issues
Some laptops act normal after a cold boot but overheat after suspend and resume. That’s a distinct class of bug. Debug it on its own.
Common signs:
- Fans spin hard after wake with no visible workload
- Idle package power doubles after resume
- The dGPU won’t go back to its runtime-suspended state
Check journal around resume:
journalctl -b -1 | grep -Ei 'suspend|resume|acpi|thermal|nvidia|amdgpu|iwlwifi'Practical fixes:
- Update kernel and firmware first. Many resume regressions are tied to a specific version.
- Reapply power policy on resume with
systemdsleep scripts. - Restart services after wake if they hold a high-power state. If certain units restart slowly and stall the resume path, profiling systemd boot timing can pin down which one blocks.
- Test
s2idlevsdeepsleep modes if firmware supports both.
Example sleep hook skeleton:
# /lib/systemd/system-sleep/99-thermal-resume
#!/bin/sh
case "$1/$2" in
post/*)
/usr/bin/tlp start >/dev/null 2>&1
;;
esacMake executable:
sudo chmod +x /lib/systemd/system-sleep/99-thermal-resumePhysical and BIOS-Level Fixes
When software is tuned and temps are still too high, you’re probably stuck on heat transfer or airflow.
High-impact physical actions:
- Heatsink and fan cleaning. Dust raises sustained temps a lot.
- Thermal paste swap. Worth doing on any laptop older than two years.
- Fan integrity check. Worn bearings cut airflow even if RPM looks fine.
- Intake and exhaust path check. Soft surfaces choke intake.
BIOS and UEFI actions:
- Update firmware first.
- Use the balanced or cool platform profile if it exists.
- Review fan curve controls when they’re exposed.
- Reset any extreme performance preset you set by accident.
A laptop stand that opens up airflow under the chassis can drop steady-state surface and internal temps for almost nothing.
Troubleshooting Matrix: Symptoms to Root Cause and Fix
| Symptom | Most likely root cause | Verify with | First fix |
|---|---|---|---|
| Hot at idle (70 C+) | Shallow C-states, background wakeups | turbostat, powertop | Governor/pstate tuning, disable noisy wake sources |
| Fans loud after resume | Resume power-policy drift | journalctl, turbostat post-wake | Resume hook to reapply TLP/CPU policy |
| CPU throttles in short bursts | Boost overshoot vs cooling limit | stress-ng + live sensors | Reduce boost/freq cap, improve fan curve |
| GPU hot while browsing | dGPU pinned active in hybrid setup | nvtop, nvidia-smi, DRM runtime state | Integrated mode, fix app forcing dGPU |
| Still throttling after tuning | Poor thermal transfer | Load test plus physical inspection | Clean heatsink, repaste, check fan health |
This table helps you avoid random tweaking. Always move from a measured symptom to a targeted fix.
Special Cases: ARM Laptops and Data Center Hosts
Thermal advice for x86 laptops doesn’t map cleanly to every system.
ARM laptops (including Asahi Linux on Apple Silicon)
Asahi uses a different thermal and power stack from regular ACPI x86 laptops. The tools and interfaces differ. x86 advice like intel_pstate or ryzenadj doesn’t apply.
Focus on:
- Asahi-specific kernel and userspace updates
- Performance profiles the platform actually supports
- Workload-level tweaks: browser and video codec paths, GPU acceleration settings
Follow the Asahi project docs before any generic Linux laptop tweaks.
Data center and rack systems
Server thermals are mostly airflow and power budget problems, not laptop governor problems. Use BMC and IPMI telemetry, rack airflow planning, and platform-level CPU power caps.
Use laptop-style tuning only on actual mobile platforms. On servers, tune fan zones, inlet temperature, and workload placement.
A Safe Implementation Order
If you want a practical plan, use this order:
- Update BIOS or UEFI, and the kernel.
- Capture baseline metrics: idle, load, post-resume.
- Pick one CPU policy manager (
tlporauto-cpufreq) and tune it carefully. - Fix the dGPU runtime power state if you’ve got hybrid graphics.
- Test kernel param changes one at a time.
- Apply undervolt or power-limit tuning only after baseline policy is stable.
- If severe symptoms remain, clean and repaste.
Re-measure after each step. Thermal work is engineering, not folklore.
Conclusion
Fixing Linux laptop heat is less about a magic command and more about order. You diagnose first, rate the severity, tune policy with care, then move up to firmware and hardware only when the data says so. The fastest path to a cooler machine is steady measurement: package power, temps, fan behavior, and throttle status before and after each change.
For most users, three actions do the heavy lifting. Pick a sane CPU policy. Make sure the discrete GPU can truly sleep. Look after the physical cooling. Add vendor tools when your platform exposes extra knobs. Treat suspend and resume as their own branch when the symptoms point there. With that method, Linux laptops can run cool, quiet, and stable even under heavy load.
Botmonster Tech