PDB 文件格式说明

为了格式化输出蛋白质PDB文件,从网上找到了PDB文件的格式说明,现copy并记录于此:

ATOM 字段

数据 格式, 对齐 说明
1-4 ATOM 字符, 左 Record Type 记录类型
7-11 serial 整数, 右 Atom serial number 原子序号.PDB文件对分子结构处理为segment, chain, residue, atom四个层次(一般并不用到chain),因此此数位限定了一个残基中的最大原子数为为99999
13-16 name 字符, 左 Atom name 原子名称.原子的元素符号在13-14列中右对齐,一般从14列开始写, 占四个字符的原子名称才会从13列开始写.如, 铁原子FE(还有氯原子CL)写在13-14列, 而碳原子C只写在14列.
17 altLoc 字符 Alternate location indicator 可替位置标示符
18-20 resName 字符 Residue name 残基名称
22 chainID 字符 Chain identifier 链标识符
23-26 resSeq 整数, 右 Residue sequence number 残基序列号
27 iCode 字符 Code for insertion of residues 残基插入码
28-30 留空
31-38 x 浮点, 右 real (8.3) Orthogonal coordinates for X in Angstroms 直角x坐标(埃)
39-46 y 浮点, 右 real (8.3) Orthogonal coordinates for Y in Angstroms 直角y坐标(埃)
47-54 z 浮点, 右 real (8.3) Orthogonal coordinates for Z in Angstroms 直角z坐标(埃)
55-60 occupancy 浮点, 右 real (6.2) Occupancy 占有率
61-66 tempFactor 浮点, 右 real (6.2) Temperature factor 温度因子
67-72 留空
73-76 segID 字符, 左 Segment identifier(optional) 可选的片段标识符,VMD会使用此数据
77-78 element 字符, 右 Element symbol 元素符号
79-80 charge 字符 Charge on the atom(optional) 可选的原子电荷.实际分子模拟中往往重新定义电荷, 故此列往往不用.VMD写出的PDB文件中无此列.

氢原子约定

PDB文件中的氢原子约定如下:
出现在ATOM记录中的氢原子, 处于特定残基所有其他原子的后面.
每个氢原子的名称根据与它相连原子的名称来确定: 名称的第一个位置(13列)为可选的数字, 当有两个或多个氢原子与同一个原子相连时才使用; 第二个位置(14列)为元素符号H; 接下来的两列包含与氢原子相连原子的远程和分支标识符(1或2个字符).
示例如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
12345678901234567890123456789012345678901234567890123456789012345678901234567890
----+----1----+----2----+----3----+----4----+----5----+----6----+----6----+----8
ATOM 1 N VAL 1 -13.090 1.966 9.741 1.00 0.00
ATOM 2 CA VAL 1 -12.852 3.121 8.892 1.00 0.00
ATOM 3 C VAL 1 -13.047 4.399 9.711 1.00 0.00
ATOM 4 O VAL 1 -12.143 5.228 9.800 1.00 0.00
ATOM 5 CB VAL 1 -13.753 3.058 7.658 1.00 0.00
ATOM 6 CG1 VAL 1 -13.930 4.446 7.036 1.00 0.00
ATOM 7 CG2 VAL 1 -13.208 2.063 6.631 1.00 0.00
ATOM 8 H VAL 1 -13.919 1.449 9.527 1.00 0.00
ATOM 9 HA VAL 1 -11.816 3.075 8.557 1.00 0.00
ATOM 10 HB VAL 1 -14.734 2.707 7.977 1.00 0.00
ATOM 11 1HG1 VAL 1 -13.951 4.357 5.950 1.00 0.00
ATOM 12 2HG1 VAL 1 -14.866 4.883 7.384 1.00 0.00
ATOM 13 3HG1 VAL 1 -13.098 5.085 7.333 1.00 0.00
ATOM 14 1HG2 VAL 1 -12.623 1.298 7.142 1.00 0.00
ATOM 15 2HG2 VAL 1 -14.039 1.594 6.104 1.00 0.00
ATOM 16 3HG2 VAL 1 -12.575 2.588 5.917 1.00 0.00

在上面的例子中,所有氢原子都出现在残基的其他原子之后;9号原子HA与2号原子CA相连;这两个原子的远程标识符A相同。有三个氢原子与CG1相连,它们具有相同的远程标识符, 分支标识符, 但13列中含有区分数字, 因此每个氢原子都具有唯一的名称;当只有一个氢原子与给定原子相连时, 不需要使用数字作为氢原子名称的前缀.

Python 输出控制

下面的输出为在13-16列输出原子名称,而没有考虑在13或者14列对齐

field id definition length format range string slicing (Python)
1 “ATOM “ or “HETATM” 6 {:6s} 01-06 [0:6]
2 atom serial number 5 {:5d} 07-11 [6:11]
3 atom name 4 {:^4s} 13-16 [12:16]
4 alternate location indicator 1 {:1s} 17 [16:17]
5 residue name 3 {:3s} 18-20 [17:20]
6 chain identifier 1 {:1s} 22 [21:22]
7 residue sequence number 4 {:4d} 23-26 [22:26]
8 code for insertion of residues 1 {:1s} 27 [26:27]
9 orthogonal coordinates for X (in Angstroms) 8 {:8.3f} 31-38 [30:38]
10 orthogonal coordinates for Y (in Angstroms) 8 {:8.3f} 39-46 [38:46]
11 orthogonal coordinates for Z (in Angstroms) 8 {:8.3f} 47-54 [46:54]
12 occupancy 6 {:6.2f} 55-60 [54:60]
13 temperature factor 6 {:6.2f} 61-66 [60:66]
14 element symbol 2 {:>2s} 77-78 [76:78]
15 charge on the atom 2 {:2s} 79-80 [78:80]

输出格式

1
"{:6s}{:5d} {:^4s}{:1s}{:3s} {:1s}{:4d}{:1s}   {:8.3f}{:8.3f}{:8.3f}{:6.2f}{:6.2f}          {:>2s}{:2s}".format(...)

awk 输出控制

1
2
awk '{printf("%-6s%5d %-4s %3s A%4d    %8.3f%8.3f%8.3f\n",$1,$2,$3,$4,$5,$6,$7,$8)}' ${fd}/${pro}.pdb 
awk '{printf("%-6s%5d %-4s%1s%3s $1s%4d%1s %8.3f%8.3f%8.3f%6.2f%6.2f %2s%-2s",$1,$2,...)}'

一些处理pdb的awk脚本

这些脚本都来自于bougui505的博客

格式化pdb文件

以下脚本可以让pdb文件只保留[ATOM] [Atom serial number] [Atom name] [Residue name] [Chain identifier] [Residue sequence number] [x] [y] [z] 这些数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
cat > read_pdb.awk

#!/usr/bin/awk -f
# -*- coding: UTF8 -*-

# Author: Guillaume Bouvier -- guillaume.bouvier@pasteur.fr
# https://research.pasteur.fr/en/member/guillaume-bouvier/
# 2017-03-15 09:04:14 (UTC+0100)

BEGIN{
# Reading Fixed-Width Data (see: https://goo.gl/SmjwUt)
FIELDWIDTHS = "6 5 1 4 1 3 1 1 4 1 3 8 8 8 6 6 6 4"
# $2: Atom serial number
# $4: Atom type
# $5: altLoc; alternate location indicator.
# $6: Resname
# $8: ChainID
# $9: Resid
# $12: x
# $13: y
# $14: z
}

{
if ($1 == "ATOM "){
printf("%-6s%5s%1s%4s%1s%3s%1s%1s%4s%1s%3s%8s%8s%8s\n", $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14)
}
}

为包含多个model的pdb文件添加 MODEL 分隔符

1
2
3
function mspdb () {
grep -v 'CRYST1' $1 | awk '{if ($2==1){c+=1;print "MODEL "c}{print $0}}' | sed 's/END/ENDMDL/' > /dev/shm/tmp.pdb && mv -f /dev/shm/tmp.pdb $1
}

计算给定pdb文件的几何中心

注意本命令中的列数可能需要根据pdb文件是否包含链ID进行修改

1
awk '{if ($1 == "ATOM") {sx+=$6;sy+=$7;sz+=$8;n+=1}} END {print sx/n,sy/n,sz/n}' model1.pdb

去除pdb文件中的氢原子

mktemp -p /dev/shm命令是为了确定可以创建一个特有的temp文件,对于并行地处理多个pdb文件非常重要。
此命令同时会覆盖原始文件,可以通过mv $tmpfile $1进行调整。

1
2
3
4
5
function stripH () {
tmpfile=$(mktemp -p /dev/shm)
awk '{if ($1=="ATOM" && $3 !~ /H/) {c+=1; printf("%-6s%5s %4s %3s %s%4s %8s%8s%8s\n", $1,c,$3,$4,$5,$6,$7,$8,$9)} \
else if ($1=="MODEL" || $1=="ENDMDL") {c=0; print $0}}' $1 > $tmpfile && mv $tmpfile $1
}

使用awk输出pdb文件中指定的selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
cat > pdbselect.sh 

#!/usr/bin/env sh
# -*- coding: UTF8 -*-

# Author: Guillaume Bouvier -- guillaume.bouvier@pasteur.fr
# https://research.pasteur.fr/en/member/guillaume-bouvier/
# 2017-04-20 16:05:00 (UTC+0200)

usage ()
{
echo "Usage"
echo "$0 'selection' pdbfile.pdb"
echo "Selection string can be:"
echo " • chain X ; with X a chain id (A, B, C, ...)"
echo " • name X ; with X an atom-name (CA, CB, ...)"
echo " • resname X ; with X a residue name (GLY, ASP, GLU, ...)"
echo " • MODEL X ; with X a model id"
echo " • protein ; select only the protein (20 standard amino acid residues)"
echo "It's possible to get negative selection with '!':"
echo "E.g.: 'chain !A' will select all the chains except chain A"
echo "It's possible to use the 'or' boolean operator:"
echo "E.g.: 'name CA or CB'"
exit
}

if [ $# -lt 1 ]; then
usage
fi

SELECTION=$1
PDB=$2

awk -v SELECTION="$SELECTION" '

function select(pdbfield,value,negate){
# Function to select a given pdbfield and print the corresponding lines
# If negate is 1, negate the selection
if (negate){
if (pdbfield!=value){
atomid+=1
printf("%-6s%5s%1s%4s%1s%3s%1s%1s%4s%1s%3s%8s%8s%8s%6s%6s%12s\n", $1,atomid,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17)
}
}
else{
if (pdbfield==value){
atomid+=1
printf("%-6s%5s%1s%4s%1s%3s%1s%1s%4s%1s%3s%8s%8s%8s%6s%6s%12s\n", $1,atomid,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17)
}
}
}

BEGIN{
NEGATE=0
N=split(SELECTION, S, " ") # N is the number of words in the selection string
FIELD=S[1]
VALUE=S[2]
N=split(VALUE, S, "!") # Split negation (if presents)
if (N>1){ #Negation
NEGATE=1
VALUE=S[2]
}

FIELDWIDTHS = "6 5 1 4 1 3 1 1 4 1 3 8 8 8 6 6 12"
# $2: Atom serial number
# $4: Atom type
# $5: altLoc; alternate location indicator.
# $6: Resname
# $8: ChainID
# $9: Resid
# $12: x
# $13: y
# $14: z
# $17: Element symbol
atomid=0
}

{
if (FIELD == "MODEL"){
if ($1 == "MODEL " && $4 == VALUE){
GETLINE=1
}
if (GETLINE==1 && $1 == "ENDMDL"){
GETLINE=0
exit 0
}
if (GETLINE==1 && $1 == "ATOM "){
print $0
}
}
if ($1 == "ATOM "){
if (SELECTION == "protein"){
SELECTION="resname ARG or HIS or LYS or ASP or GLU or SER or THR or ASN or GLN or CYS or GLY or PRO or ALA or VAL or ILE or LEU or MET or PHE or TYR or TRP"
}
N=split(SELECTION, S, " | or ") # N is the number of words in the selection string
FIELD=S[1]
for (i in S){
NEGATE=0
if (i>1){
VALUE=S[i]
N2=split(VALUE, V, "!") # Split negation (if presents)
if (N2>1){ #Negation
NEGATE=1
VALUE=V[2]
}
if (FIELD=="chain"){
CHAINID=$8
select(CHAINID,VALUE,NEGATE)
}
if (FIELD=="name"){
ATOM = $4
gsub(/ /, "", ATOM)
select(ATOM,VALUE,NEGATE)
}
if (FIELD=="resname"){
RESNAME = $6
select(RESNAME,VALUE,NEGATE)
}
}
}
}
}

END {
print "END"
}' "$PDB"

对原子或者残基进行重新编号

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
cat >pdb_renumber.sh 

#!/usr/bin/env sh
# -*- coding: UTF8 -*-

# Author: Guillaume Bouvier -- guillaume.bouvier@pasteur.fr
# https://research.pasteur.fr/en/member/guillaume-bouvier/
# 2017-04-20 16:05:00 (UTC+0200)

usage ()
{
echo "Usage"
echo "$0 'selection' pdbfile.pdb"
echo "Selection string can be:"
echo " • residues: renumber the resids of the pdb"
echo " • atoms: renumber the atom numbers"
exit
}

if [ $# -ne 2 ]; then
usage
fi

SELECTION=$1
PDB=$2

awk -v SELECTION="$SELECTION" '

function printpdb(selection){
if (selection=="atoms"){
atomid+=1
printf("%-6s%5s%1s%4s%1s%3s%1s%1s%4s%1s%3s%8s%8s%8s%6s%6s%6s%4s\n", $1,atomid,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18)
}
if (selection=="residues"){
if ($9 != resid_prev){
resid+=1
resid_prev = $9
}
printf("%-6s%5s%1s%4s%1s%3s%1s%1s%4s%1s%3s%8s%8s%8s%6s%6s%6s%4s\n", $1,$2,$3,$4,$5,$6,$7,$8,resid,$10,$11,$12,$13,$14,$15,$16,$17,$18)
}
}

BEGIN{
FIELDWIDTHS = "6 5 1 4 1 3 1 1 4 1 3 8 8 8 6 6 6 4"
# $2: Atom serial number
# $4: Atom type
# $5: altLoc; alternate location indicator.
# $6: Resname
# $8: ChainID
# $9: Resid
# $12: x
# $13: y
# $14: z
atomid=0
resid=0
resid_prev=0
}

{
printpdb(SELECTION)
}' "$PDB"

分割含有多model的pdb文件

1
awk '$0 ~ /ATOM      1/ {i++} {print >> "pdbs/out_"i".pdb"} {fflush("pdbs/out_"i".pdb")}' filename.pdb

也可以使用csplit命令,下面的例子中,使用csplit命令分割了一个包含50000个model的pdb文件,原始pdb中每个model使用MODEL X and ENDMDL进行截断。

1
2
3
4
csplit -z -f /dev/shm/docking_ -n 5 docking_results.pdb '/ENDMDL/1' '{49999}'

`-n`: number of digits
`-z`: remove empty output files

输出文件为:

1
2
3
4
docking_00000
docking_00001
...
docking_49999

如果希望添加后缀(比如.pdb):

1
csplit -z -f /dev/shm/docking_ -b '%05d.pdb' docking_results.pdb '/ENDMDL/1' '{49999}'

最终脚本如下:

1
2
3
4
function splitpdb () {
n_models=$(grep -c MODEL $1)
csplit -z -f /dev/shm/model_ -b '%04d.pdb' $1 '/ENDMDL/1' "{$(expr $n_models - 1)}"
}

如果希望每个输出文件中包含100个model:

1
awk '$0 ~ /ATOM 1/ {i++} {print >> "pdbs/smap_"int(i/100)".pdb"} {fflush("pdbs/smap_"int(i/100)".pdb")}' smap.pdb


Ref: