PostgreSQL CPU占用100% 问题分析
1 使用nmon查看系统资源,或如下ps命令
ps auxw --sort=%cpu|tail -10
使用如下sql确定相应sql
SELECT procpid, START, now( ) - START AS lap, current_query FROM ( SELECT backendid, pg_stat_get_backend_pid ( S.backendid ) AS procpid, pg_stat_get_backend_activity_start ( S.backendid ) AS START, pg_stat_get_backend_activity ( S.backendid ) AS current_query FROM ( SELECT pg_stat_get_backend_idset ( ) AS backendid ) AS S ) AS S WHERE current_query <> '<IDLE>' AND procpid = 121231 --相应操作系统PID ORDER BY lap DESC;
2 追踪慢SQL
突然间的慢,基本都与特定的SQL的执行计划发生变化有关,如何确定慢SQL或者说问题SQL,有如下几种方法.
1)第一种方法是,pg_stat_statements插件
使用前,先复位下这个插件的统计信息
select pg_stat_reset(); select pg_stat_statements_reset();
等待一段时间(例如1分钟),使计数器积累足够的信息,然后查询最耗时的SQL(一般就是导致问题的直接原因)。
select userid::regrole, dbid, total_time,query from pg_stat_statements order by total_time desc limit 20;
查询读取Buffer次数最多的SQL,这些SQL可能由于所查询的数据没有索引,而导致了过多的Buffer读,也同时大量消耗了CPU。
SELECT userid :: regrole, dbid, ( shared_blks_hit + shared_blks_read ) AS shared_buffer_read, query FROM pg_stat_statements ORDER BY shared_blks_hit + shared_blks_read DESC LIMIT 5;
2)第二种方法是,直接通过pg_stat_activity视图
利用下面的查询,查看当前长时间执行,一直不结束的SQL。这些SQL对应造成CPU满,也有直接嫌疑。
SELECT 'select pg_terminate_backend(' || pid || ');' AS KILL, backend_xid, datname, usename, client_addr, application_name, STATE, backend_start, xact_start, xact_stay, query_start, query_stay, REPLACE ( QUERY, chr ( 10 ), ' ' ) AS QUERY FROM ( SELECT pgsa.pid, pgsa.datname AS datname, pgsa.usename AS usename, pgsa.client_addr client_addr, pgsa.application_name AS application_name, pgsa.STATE AS STATE, pgsa.backend_xid, pgsa.backend_start AS backend_start, pgsa.xact_start AS xact_start, EXTRACT ( epoch FROM ( now( ) - pgsa.xact_start ) ) AS xact_stay, pgsa.query_start AS query_start, EXTRACT ( epoch FROM ( now( ) - pgsa.query_start ) ) AS query_stay, pgsa.QUERY AS QUERY FROM pg_stat_activity AS pgsa WHERE pgsa.STATE != 'idle' AND pgsa.STATE != 'idle in transaction' AND pgsa.STATE != 'idle in transaction (aborted)' ) idleconnections WHERE backend_xid IS NULL -- and query like '%V1.organization_no%' ORDER BY query_stay DESC LIMIT 50;
3)第3种方法,是从数据表上表扫描(Table Scan)的信息开始查起
查找缺失索引的表。数据表如果缺失索引,数据库只能使用表扫描,并需要处理已在内存中的大量的无关记录,而耗费大量CPU。
通过下面的查询,查出使用表扫描最多的表:
select * from pg_stat_user_tables where n_live_tup > 100000 and seq_scan > 0 order by seq_tup_read desc limit 10;
查询当前正在运行的访问到上述表的慢查询:
select * from pg_stat_activity where query ilike '%<table name>%' and query_start - now() > interval '10 seconds';
也可以通过pg_stat_statements插件定位涉及到这些表的查询:
select * from pg_stat_statements where query ilike '%<table>%'order by shared_blks_hit+shared_blks_read desc limit 10;
3 处理慢SQL
对于上面的方法查出来的慢SQL,首先需要做的可能是Cancel或Kill掉他们,使业务先恢复:
select pg_cancel_backend(pid) from pg_stat_activity where query like '%<query text>%' and pid != pg_backend_pid(); select pg_terminate_backend(pid) from pg_stat_activity where query like '%<query text>%' and pid != pg_backend_pid();
4 优化慢SQL
如果这些SQL确实是业务上必需的,则需要对他们做优化,这方面有“三板斧”:
1)对查询涉及的表,执行ANALYZE table_name或VACUUM ANZLYZE table_name,更新表的统计信息,使查询计划更准确。注意,为避免对业务影响,最好在业务低峰执行。
2)执行explain (query text)或explain (buffers true, analyze true, verbose true) (query text)命令,查看SQL的执行计划(注意,前者不会实际执行SQL,后者会实际执行而且能得到详细的执行信息),对其中的Table Scan涉及的表,建立索引。
3)重新编写SQL,去除掉不必要的子查询、改写UNION ALL、使用JOIN CLAUSE固定连接顺序等到,都是进一步深度优化SQL的手段。
5 其它常用SQL
1)那么如何知道有哪些表、哪些进程被锁住了?
可以用如下SQL查出来:
SELECT * FROM pg_locks A JOIN pg_class b ON A.relation = b.oid JOIN pg_stat_activity C ON A.pid = C.pid WHERE A.MODE LIKE'%ExclusiveLock%';
杀掉指定表指定锁的进程
SELECT pg_cancel_backend ( A.pid ) FROM pg_locks A JOIN pg_class b ON A.relation = b.oid JOIN pg_stat_activity C ON A.pid = C.pid WHERE b.relname ILIKE'表名' AND A.MODE LIKE'%ExclusiveLock%'; --或者使用更加霸道的pg_terminate_backend(): SELECT pg_terminate_backend ( A.pid ) FROM pg_locks A JOIN pg_class b ON A.relation = b.oid JOIN pg_stat_activity C ON A.pid = C.pid WHERE b.relname ILIKE'表名' AND A.MODE LIKE'%ExclusiveLock%';
2)查看enqueue锁
SELECT pid, datname, usename, wait_event_type, wait_event, pg_blocking_pids ( pid ) AS blocked_by, query FROM pg_stat_activity WHERE--wait_event IS NOT NULL STATE != 'idle' AND STATE != 'idle in transaction' AND STATE != 'idle in transaction (aborted)';
近期评论