API References¶
Important
This project is renamed from pdbx2df. Please go to its documentation for historical features.
PDBx Reader¶
PDBx/mmCIF format reading.
Reads a PDBx file into a dict of Pandas DataFrame s.
The dict keys are read from the mmCIF category names automatically.
For each category, the column names of the corresponding DataFrame
are read form the category attributes automatically.
For example:
_audit_conform.dict_name mmcif_pdbx.dic
_audit_conform.dict_version 5.355
_audit_conform.dict_location http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic
In this category, the category name is _audit_conform, so the returned dict
key is _audit_conform. The category attributes are dict_name,
dict_version, and dict_location, so the returned dict value as a
DataFrame has the exact column names.
- moldf.read_pdbx.AF2_MODEL = 4¶
For AlphaFold structures, the version to use.
- moldf.read_pdbx.read_pdbx(pdbx_file: str | PathLike[str] | None = None, pdb_id: str | None = None, save_pdbx_file: bool = True, pdbx_file_dir: str | PathLike | None = None, category_names: list | None = None, convert_dtype: bool = False) dict[str, DataFrame][source]¶
Reads a
.ciffile’s categories into adictofPandas DataFrames.- Parameters:
pdb_id (optional) – PDB/Uniprot ID. Required if
pdbx_fileisNone. Defaults to None.pdbx_file (optional) – file name for a PDBx/mmCIF file. Used over pdb_id. Defaults to None.
category_names (optional) – a list of categories in the mmCIF file format. If
None,allis used and all categories will be processed. Defaults to None.save_pdbx_file (optional) – whether to save the fetched PDBx file from RCSB to
pdbx_file_dir. Defaults to False.pdbx_file_dir (optional) – directory to save fetched PDBx files. If
Nonebutsave_pdbx_fileisTrue, ‘./PDBx_files’ is used. Defaults to None.convert_dtype (optional) – whether to convert the data types according to the RCSB mmcif specifications. Defaults to False.
- Returns:
A dict of
Pandas DataFrames corresponding to required categories.- Raises:
ValueError – if none of
pdb_idorpdbx_fileis provided, or ifpdb_idis given but cannot the PDB file cannot be downloaded from RCSB, or the PDB file is corrupted like no end-line symbol, or some content is irregular.FileNotFoundError – if
pdbx_filecannot be found.
PDB Reader¶
PDB format reading.
Reads a PDB file, including Chimera compatible ones, into a dict of
Pandas DataFrame s.
Atom coordinates and sequences are currently supported by the following category names:
_atom_site:ATOM,HETATM,TER,NUMMDL,MODEL, andENDMDLlines.
_seq_res:SEQRESlines.
- moldf.read_pdb.IMPLEMENTED_PDB_CATS = ['_atom_site', '_seq_res', '_chem_comp']¶
PDB categories that are currently implemented.
- moldf.read_pdb.ATOM_SITE = ('ATOM', 'HETATM', 'TER')¶
_atom_siteprimary lines.
- moldf.read_pdb.NMR_MDL = ('NUMMDL', 'MODEL', 'ENDMDL')¶
_atom_siteadditional lines.
- moldf.read_pdb.CHEM_COMP = ('HET ', 'HETNAM', 'FORMUL', 'HETSYN')¶
_chem_complines.
- moldf.read_pdb.AF2_MODEL = 4¶
For AlphaFold structures, the version to use.
- moldf.read_pdb.read_pdb(pdb_file: str | PathLike[str] | None = None, pdb_id: str | None = None, category_names: list | None = None, save_pdb_file: bool = True, pdb_file_dir: str | PathLike | None = None, allow_chimera: bool = True, need_ter_lines: bool = True) dict[str, DataFrame][source]¶
Reads a
.pdbfile’s categories into adictofPandas DataFrames.- Parameters:
pdb_id (optional) – PDB/Uniprot ID. Required if
pdb_fileisNone. Defaults to None.pdb_file (optional) – file name for a PDB file. Used over pdb_id. Defaults to None.
category_names (optional) –
a list of categories similar to the mmCIF format. If
None,_atom_siteis used. To be consistent with the PDBx format, the following category names are used to refer to block(s) in a PDB file and only they are supported:1.
_atom_site:ATOM,HETATM, andTERlines and possibleNUMMDL,MODEL, andENDMDLlines._seq_res:SEQRESlines._chem_comp: all lines inCHEM_COMPabove. However, onlycompounds included in the ``HET `` lines are extracted which usually means the water molecule is not included.
Defaults to None.
save_pdb_file (optional) – whether to save the fetched PDB file from RCSB to
pdb_file_dir. Defaults to True.pdb_file_dir (optional) – directory to save fetched PDB files. If
Nonebutsave_pdb_fileisTrue, ‘./PDB_files’ is used. Defaults to None.allow_chimera (optional) – whether to allow Chimera-formatted PDB files. Defaults to True.
need_ter_lines (optional) – whether to read the
TERlines into theDataFrame. Defaults to True.
- Returns:
A dict of
Pandas DataFrames corresponding to required categories.- Raises:
ValueError – if none of
pdb_idorpdbx_fileis provided, or ifpdb_idis given but cannot the PDB file cannot be downloaded from RCSB,NotImplementedError – if category_names not a subset of allowed names.
FileNotFoundError – if
pdbx_filecannot be found.
- moldf.read_pdb._split_atom_line(line: str, nmr_model: int = -1, allow_chimera: bool = True, is_ter_line: bool = False) tuple[source]¶
Internal function to parse a single line belonging to
ATOM,HETATM, orTERlines.- Parameters:
line (required) – A
ATOM,HETATM, orTERline.nmr_model (optional) – the NMR model number for the line;
-1means not an NMR model. Defaults to -1.allow_chimera (optional) – try to parse as a Chimera-formatted PDB file. Defaults to True
is_ter_line (optional) – whether the line starts with
TER. Defaults to False.
- Returns:
parsed values as a tuple.
MOL2 Reader¶
Mol2 format reading.
Read a Tripos .mol2 file into a dictionary of pandas DataFrames. Different
categories like ‘ATOM’ and ‘BOND’ are read into different DataFrame objects.
- moldf.read_mol2.IMPLEMENTED_MOL2_CATS = ['ATOM', 'MOLECULE', 'BOND', 'HEADER']¶
MOL2 categories that are currently implemented.
- moldf.read_mol2.ATOM_COL_NAMES = ('atom_id', 'atom_name', 'x', 'y', 'z', 'atom_type', 'subst_id', 'subst_name', 'charge', 'status_bit')¶
MOL2
ATOMcolumn names.
- moldf.read_mol2.BOND_COL_NAMES = ('bond_id', 'origin_atom_id', 'target_atom_id', 'bond_type', 'status_bit')¶
MOL2
BONDcolumn names.
- moldf.read_mol2.read_mol2(mol2_file: str | PathLike, category_names: list | None = None) dict[str, DataFrame][source]¶
Reads a
.mol2file’s categories into adictofPandas DataFrames.- Parameters:
mol2_file (required) – file name for a PDB file.
category_names (optional) – a list of categories as to the
.mol2file format. IfNone, ['ATOM','MOLECULE','BOND','HEADER'] is used. Defaults to None.
- Returns:
A dict of
category_nameas keys(s) andpd.DataFrameas values.- Raises:
NotImplementedError – if
category_namesnot a subset of ['ATOM','MOLECULE','BOND','HEADER']
- moldf.read_mol2._get_header_df(header_lines: list[tuple]) DataFrame[source]¶
Turns the
HEADERlines into aPandas DataFrame. TheHEADERlines are those starting with one or multiple#symbols.- Parameters:
header_lines (required) – a list of tuples corresponding to each line’s content. Tuples are generate by splitting the lines by
:.- Returns:
Pandas DataFrameof TheHEADERcategory
- moldf.read_mol2._get_molecule_df(molecule_lines: list[tuple]) DataFrame[source]¶
Turns the
MOLECULElines into aPandas DataFrame.- Parameters:
molecule_lines (required) – a list of tuples corresponding to each line’s content.
- Returns:
Pandas DataFrameof TheMOLECULEcategory
JCSV Reader¶
JCSV format reading.
Reads a JCSV file into a dict of Pandas DataFrame s.
It is not limited to any molecular format.
- moldf.read_jcsv.read_jcsv(jcsv_file: str | PathLike, category_names: list | None = None) dict[str, DataFrame][source]¶
Reads a JCSV file by name.
Currently no molecular file repository has JCSV files so we can only read from a file name/path.
- Parameters:
jcsv_file (required) – JCSV file name/path.
category_names (optional) – a list of category names. If
None, all categories are read. Defaults to None.
- Returns:
a dict of Pandas DataFrames for each category.
- Raises:
TypeError – if
category_namesis not a list of strings.ValueError – if any of the
category_nameshas double quotes or if the number of items in any line does not match the number of column names in the same category.
- moldf.read_jcsv._read_jcsv_by_line(jcsv_file: str | PathLike, category_names: list | None = None) dict[str, DataFrame][source]¶
Reads JCSV file line by line when the file has no meta data to select blocks.
- Parameters:
jcsv_file (required) – JCSV file name/path.
category_names (optional) – a list of category names. If
None, all categories are read. Defaults to None. It is passed by theread_jcsvcaller, so it is not sanitized here.
- Returns:
a dict of Pandas DataFrames for each category.
- Raises:
ValueError – if the number of items in any line does not match the number of column names in the same category.
- moldf.read_jcsv._count_n_lines(file_name: str | PathLike)[source]¶
Gets the number of lines in a file. From https://stackoverflow.com/a/68385697/10094189
- Parameters:
file_name (required) – file name or path.
PDBDataFrame Class¶
PDBDataFrame as a subclass of Pandas DataFrame.
Several features are added to make PDB data more accessible and selectable:
1. Properties like sequences, heavy_atoms, backbone, and water are
directly accessed by . operation.
2. Atom selection by using methods whose names are just the column names plus s
(plural form). For example, selecting atoms by names is simply
df.atom_names([names]) where atom_name is the column name
and atom_names is the selection function. Each selection returns a
PDBDataFrame object as well, which means we can chain selections one by one
like df.atom_names([names]).residue_numbers([numbers]).
Distance matrix as a
@propertyand@classmethod.
- moldf.pdb_dataframe.RESIDUE_CODES = {'ALA': 'A', 'ARG': 'R', 'ASH': 'D', 'ASN': 'N', 'ASP': 'D', 'CYM': 'C', 'CYS': 'C', 'CYX': 'C', 'GLH': 'E', 'GLN': 'Q', 'GLU': 'E', 'GLY': 'G', 'HID': 'H', 'HIE': 'H', 'HIP': 'H', 'HIS': 'H', 'ILE': 'I', 'LEU': 'L', 'LYN': 'K', 'LYS': 'K', 'MET': 'M', 'PHE': 'F', 'PRO': 'P', 'PYL': 'O', 'SEC': 'U', 'SER': 'S', 'THR': 'T', 'TRP': 'W', 'TYR': 'Y', 'VAL': 'V'}¶
dict[str, str], turn 3-, 2-, and 1-letter residue codes to 1-letter codes.
- moldf.pdb_dataframe.PDBX_COLS = {'alt_loc': 'label_alt_id', 'atom_name': 'label_atom_id', 'atom_number': 'id', 'b_factor': 'B_iso_or_equiv', 'chain_id': 'label_asym_id', 'charge': 'pdbx_formal_charge', 'element_symbol': 'type_symbol', 'insertion': 'pdbx_PDB_ins_code', 'nmr_model': 'pdbx_PDB_model_num', 'occupancy': 'occupancy', 'record_name': 'group_PDB', 'residue_name': 'label_comp_id', 'residue_number': 'label_seq_id', 'segment_id': 'label_entity_id', 'x_coord': 'Cartn_x', 'y_coord': 'Cartn_y', 'z_coord': 'Cartn_z'}¶
dict[str, str], PDB and mmCIF column name dictionary.
- class moldf.pdb_dataframe.PDBDataFrame(*args, pdb_format: str | None = None, use_squared_distance: bool = True, use_square_form: bool = False, **kwargs)[source]¶
Bases:
DataFramePandas DataFrame with extended attributes and methods for PDB data.
It enables Pythonic atom selection methods and convenient
.accessing to common PDB structure properties.- Parameters:
*args – all
pd.DataFramepositional arguments. For example, the_atom_sitedataframe returned by reading a PDB file.pdb_format (optional) – PDB format in the underlying provided data. If
None,PDBis assumed. Defaults to None.use_squared_distance (optional) – whether to use squared distance when calculating distance matrix. Defaults to True.
use_square_form (optional) – whether to use a square matrix for the distance matrix. Defaults to False.
**kwargs – all
pd.DataFrameacceptable keyword arguments.
- Returns:
A
PDBDataFrameinstance.
Example
>>> from moldf import read_pdb, PDBDataFrame >>> pdb = read_pdb(pdb_id='1vii') >>> pdb_df = pdb['_atom_site'] >>> pdb_df = PDBDataFrame(pdb_df)
Warning
This subclass uses a custom
__hash__function for caching some calculations. And thus a custom__eq__function is also implemented. For other typicalDataFrameoperations, use those.all(),.any(),.bool()functions to do comparison.- _metadata: list[str] = ['_use_squared_distance', '_use_square_form', '_is_chimera', '_RESIDUE_CODES', '_ELEMENT_MASSES', '_pdb_format']¶
- property _constructor¶
Used when a manipulation result has the same dimensions as the original.
- _pdbx_to_pdb(keep_original: bool = False)[source]¶
Converts PDBx ‘_atom_site’ DataFrame to PDB format.
- Parameters:
keep_original (optional) – whether to keep the original columns in the PDBx ‘_atom_site’ DataFrame. Defaults to False.
- property RESIDUE_CODES: dict[str, str]¶
A dict of
residue_nameas keys andresidue_codeas values, whereresidue_codeis a 1-character code used in sequences. Settable.
- property ELEMENT_MASSES: dict[str, float]¶
A dict of
element_symbolas keys andelement_massas values, whereelement_massis taken from NIST. Settable.
- property is_chimera: bool¶
Whether the original read-in PDB was Chimera compatible format. The main effect is the
residue_namestr width is 4 in the Chimera compatible format instead of 3 as in the standard PDB format. Not settable.
- property use_squared_distance: bool¶
Whether R or R^2 is used in distance matrix calculations. Using R^2 saves computation time. Settable.
- property use_square_form: bool¶
Whether the distance matrix will be in a square form. Using square form consumes less memory. Settable.
- property atoms: Self¶
Gets atoms in the
ATOMandHETATMentries. In other words, removing ‘TER’ lines.- Returns:
sub
PDBDataFrame.
- property coords: Self¶
Gets the
x_coord,y_coord, andz_coordcolumns only. Usepdb_df.coords.valuesto get the underlying Numpy array of the coordinates.
- property bonds: dict¶
- Gets the list of bonds. Each bond is represented as a pair of
atom_numbervalues.
- Raises:
ValueError – if the list of
atom_numberis not a set.
- get_bonds_by_distance(single_radii_set: str | None = None, need_non_covalent: bool = False, non_covalent_cutoff: float = 4.5) dict[source]¶
Gets all the bonds purely by covalent radii constraints.
- Parameters:
single_radii_set (optional) – radii sets to use. If
None,single_Cis used as to Cordero (PMID 18478144). Another option issingle_PAwhich refers to Pyykkö’s studies (PMID 19058281;19856342;15832398, and doi:10.1103/PhysRevB.85.024115). Defaults to None.need_non_covalent (optional) – whether non-covalent ‘bonding’ should be included. Defaults to False.
non_covalent_cutoff (optional) – distance cutoff for non-covalent ‘bonding’.
- Raises:
ValueError – if the list of
atom_numberis not unique or if thesingle_radii_setis not valid.- Returns:
- a dictionary of bonds with tuple of
atom_numberas keys and bond types as values.
- a dictionary of bonds with tuple of
- property sequences: dict[str, str]¶
Gets the sequences for each chain as a dict of
chain_idas key(s) andchain_sequenceas value(s).
- property residue_list: list[tuple]¶
Gets all residues as a list of tuple (
chain_id,residue_name,residue_number).
- property center_of_geometry: ndarray¶
Gets the center of geometry as a
(3, )np.ndarray.
- property center_of_mass: ndarray¶
Gets the center of mass as a
(3, )np.ndarray.
- property distance_matrix: ndarray¶
Gets the distance matrix.
- rmsd(other: Self | ndarray | None = None, align: bool = True, weights: list | None = None, selection: Self | list | None = None) list | float[source]¶
Calculates RMSD 1) among sets of coordinates in one
PDBDataFramewith multiplenmr_models or 2) two sets of coordinates in twoPDBDataFrames.- Parameters:
other (optional) – the other
PDBDataFrameor (N, 3)numpy.ndarrayto calculate RMSD against. IfNone,selfshould contain at least two sets of coordinates (nmr_modelhas >= 2 unique values). Defaults to None.align (optional) – whether to align the structures before calculating RMSD. If
False, theweightsandselectionkeywords are ignored. Defaults to True.weights (optional) – a list of weights for all the atoms in
selectionto do structure alignment. IfNone, all coordinates in theselectionorselfhave the same weights. Defaults to None.selection (optional) – a list of
atom_numbers inselfor a PDBDataFrame after the filtering methods. IfNone, all coordinates inselfare used for structure alignment. Defaults to None.
- Returns:
RMSD or a list of RMSD’s.
- Raises:
ValueError – if dimensionalities mismatch among
self,other,weights, andselectionif they are notNone; oratom_numbers inselfare not unique.TypeError – if
other,weights, andselectionhave unsupported types.
- record_names(names: list[str], invert: bool = False) Self[source]¶
Filter by
record_name.- Parameters:
names (required) – a list of
record_names.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- atom_numbers(numbers: list[int] | int, relation: str | None = None, invert: bool = False) Self[source]¶
Filter by
atom_number.- Parameters:
numbers (required) – one or a list of
atom_numbers.relation (optional) –
atom_numberrelationship tonumbers. Ifnumbersis an integer, it has to be one of<,<=,=,>=, and>. IfNone,<=is used. Ignored if a list of integers are provided tonumbers. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- atom_names(names: list[str], names_2c: list[str] | None = None, invert: bool = False, suppress_warning: bool = False) Self[source]¶
Filter by
atom_name.- Parameters:
names (required) – a list of
atom_names whoseelement_symbolshave only one character. Atoms in common residues and ligands should be provide here likeC, H, O, N, S, P, F.names_2c (optional) – a list of
atom_names whoseelement_symbolshave two characters like ion (FE) and chloride (CL). Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
suppress_warning – whether to suppress the warning message about possible conflicts between
namesandnames_2c. Defaults to False.
- Returns:
sub
PDBDataFrame
- alt_locs(locs: list[str], invert: bool = False) Self[source]¶
Filter by
alt_loc.- Parameters:
locs (required) – a list of
alt_locs.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub PDBDataFrame
- residue_names(names: list[str], invert: bool = False) Self[source]¶
Filter by
residue_names.- Parameters:
names (required) – a list of
residue_namesinvert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- chain_ids(ids: list[str], invert: bool = False) Self[source]¶
Filter by
chain_id.- Parameters:
ids (required) – a list of
chain_ids.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- residue_numbers(numbers: list[int] | int, relation: str | None = None, invert: bool = False) Self[source]¶
Filter by
residue_number.- Parameters:
numbers (required) – one or a list of
residue_numbers.relation (optional) –
residue_numberrelationship tonumbers. Ifnumbersis an integer, it has to be one of<,<=,=,>=, and>. IfNone, ‘<=’ is used. Ignored if a list of integers are provided tonumbers. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- insertions(codes: list[str], invert: bool = False) Self[source]¶
Filter by
insertion.- Parameters:
codes (required) – a list of
insertioncodes.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- x_coords(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self[source]¶
Filter by
x_coord.- Parameters:
value (required) – value to select
x_coords.relation (optional) –
x_coordrelationship tovalue. It has to be one of'<','<=','=','>=', and'>'. IfNone,'<='is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(x_coord - value)<=epsilonare selected wheninvert=False. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- y_coords(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self[source]¶
Filter by
y_coord.- Parameters:
value (required) – value to select
y_coords.relation (optional) –
y_coordrelationship tovalue. It has to be one of'<','<=','=','>=', and'>'. IfNone,'<='is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(y_coord - value)<=epsilonare selected wheninvert=False. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- z_coords(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self[source]¶
Filter by
z_coord.- Parameters:
value (required) – value to select
z_coords.relation (optional) –
z_coordrelationship tovalue. It has to be one of'<','<=','=','>=', and'>'. IfNone,'<='is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(z_coord - value)<=epsilonare selected wheninvert=False. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- occupancies(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self[source]¶
Filter by
occupancy.- Parameters:
value (required) – value to select
occupancys.relation (optional) –
occupancyrelationship tovalue. It has to be one of'<','<=','=','>=', and'>'. IfNone,'<='is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(occupancy - value)<=epsilonare selected wheninvert=False. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- b_factors(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self[source]¶
Filter by
b_factor.- Parameters:
value (required) – value to select
b_factors.relation (optional) –
b_factorrelationship tovalue. It has to be one of'<','<=','=','>=', and'>'. IfNone,'<='is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(b_factor - value)<=epsilonare selected wheninvert=False. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- segment_ids(ids: list[str], invert: bool = False) Self[source]¶
Filter by
segment_id.- Parameters:
ids (required) – a list of
segment_ids.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- element_symbols(symbols: list[str], invert: bool = False) Self[source]¶
Filter by
element_symbol.- Parameters:
symbols (required) – a list of
element_symbols.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- charges(charges: list[str], invert: bool = False) Self[source]¶
Filter by
charge.- Parameters:
charges (required) – a list of
charges.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
Notes:
chargeis2-charstring in the PDB specifications.
- nmr_models(models: list[int] | int, relation: str | None = None, invert: bool = False) Self[source]¶
Filter by
nmr_model.- Parameters:
models (required) – one or a list of
nmr_modelids.relation (optional) –
nmr_modelrelationship tomodels. Ifmodelsis an integer, it has to be one of'<','<=','=','>=', and'>'. IfNone,'<='is used. Ignored if a list of integers are provided tomodels. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- distances(other: ndarray | Self | Iterable, cut_off: float = inf, to: str | None = None, invert: bool = False) Self[source]¶
Filter by
distanceto a reference point or group of atoms.- Parameters:
other – the other group’s coordinate(s).
cut_off – the distance cutoff to filter.
to –
if
otheris a group atoms, using which method to determine whether the atoms meet the cut_off distance. If None,COMor center of mass is used ifotherisPDBDataFrame, andCOGor center of geometry is used ifotherisnp.ndarrayorIterable. The following are allowed:com,center of mass,center_of_mass:use the center of mass for the
other.cog,center of geometry,center_of_geometry:use the center of geometry for the
other.all:whether all the pair-distances meet the
cut_offdistance criteria.any:whether any of the pair-distances meets the
cut_offdistance criteria.
invert – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- _filter_num_col(value: int | float | list[int], num_col_name: str, relation: str | None = None, invert: bool = False, epsilon: float | int = 0.01, suppress_warning: bool = False) Self[source]¶
Generic function to do filter by a numerical column.
- Parameters:
value (required) – value(s) to select by the column given by the
num_col_nameinput.num_col_name (required) – one of
atom_number,residue_number,x_coord,y_coord, orz_coord,occupancy,b_factor, andnmr_model. Note: thechargecolumn is not numerical byPDBformat.relation (optional) –
x/y/z_coordrelationship tovalue. It has to be one of'<','<=','=','>=', and'>'. IfNone,'<='is used. Ignored if a list of integers are provided tovalue. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs``(``num_col_value-value) <=epsilonare selected wheninvert=Falseandrelation='='. Ignored if a list of integers are provided tovalue. Defaults to 0.01.suppress_warning (optional) – whether to suppress warnings. Defaults to False.
- Returns:
sub
PDBDataFrame- Raises:
ValueError – if xyz not in [
atom_number,residue_number,x_coord,y_coord, orz_coord,occupancy,b_factor,nmr_model] orrelationnot in ['<','<=','=','>=','>'] when selecting on float cols.
- classmethod get_chain_list(pdb_df: Self) list[source]¶
Gets the list of chain ids given a
PDBDataFrameobject.- Parameters:
pdb_df (required) – a
PDBDataFrameobject.- Returns:
a list of chain ids.
Warning
- This method relies on the original and standard numbering of the atoms and
residues in the DataFrame. Therefore, any selecting or sorting of the it can lead to wrong results.
- classmethod get_residue_list(pdb_df: Self, include_heteros: bool = False) list[tuple][source]¶
Gets the list of residues given a
PDBDataFrameobject.- Parameters:
pdb_df (required) – a
PDBDataFrameobject.include_heteros (optional) – whether to include hetero ligands. Defaults to False.
- Returns:
a list of residues as (
chain_id,residue_name,residue_number).
Warning
- This method relies on the original and standard numbering of the atoms and
residues in the DataFrame. Therefore, any selecting or sorting of the it can lead to wrong results.
- classmethod get_distance_matrix(pdb_df: Self, other_data: Self | tuple | None = None, use_r2: bool = True, square_form: bool = False) ndarray[source]¶
Calculates the distance matrix given a
PDBDataFrameobject and (optional) reference data.- Parameters:
pdb_df (required) – a
PDBDataFrameobject.other_data (optional) – the coordinates of to calculate the distances against. Defaults to None.
use_r2 (optional) – whether to use r^2 or r for distance matrix. Defaults to True.
square_form (optional) – whether to output a square form of the density matrix. If two
PDBDataFrame``s are different or ``other_datais not aPDBDataFrame,square_formis ignored. Defaults to False.
- Returns:
distance matrix (squared or condensed form)
- Raises:
ValueError – if
other_datais not ofPDBDataFrame|tuple|Nonetype or wrong shape if it is atuple.
PDBx Writer¶
PDBx/mmCIF format writing.
Write a dict of Pandas DataFrame back to a PDBx file.
- moldf.write_pdbx.write_pdbx(pdbx: dict[str, DataFrame], file_name: str | PathLike | None = None) None[source]¶
Writes a
dictofPandas DataFrames into a PDBx file.- Parameters:
pdbx (required) – a
dictofPandas DataFrames to write.file_name (optional) – file name to write a PDBx file. If
None,moldf_output.cifwill be used as the file name. Defaults to None.
- Raises:
TypeError – if
pdbxis not a valid dict ofDataFrame.
PDB Writer¶
PDB format writing.
Write a dict of Pandas DataFrame back to a PDB file.
Currently, only the _atom_site category can be written back.
- moldf.write_pdb.IMPLEMENTED_PDB_CATS = ['_atom_site']¶
PDB categories that are currently implemented.
- moldf.write_pdb.write_pdb(pdb: dict[str, DataFrame], file_name: str | PathLike | None = None, allow_chimera: bool = False) None[source]¶
Write a dict of
Pandas DataFrames into a PDB file.- Parameters:
pdb (required) – a
dictofPandas DataFrames to write.file_name (optional) – file name to write a PDB file. If
None,moldf_output.pdbwill be used as the file name. Defaults to None.allow_chimera (optional) – whether to allow writing to Chimera-formatted PDB files. Defaults to False.
- Raises:
TypeError – if
pdbis not a valid dict ofDataFrame.ValueError – if the
pdbcontains other than supported categories.
MOL2 Writer¶
MOL2 format writing.
Write a dict of Pandas DataFrame back to a MOL2 file.
Currently, only the MOLECULE, ATOM, and BOND categories can be written back.
- moldf.write_mol2.IMPLEMENTED_MOL2_CATS = ['MOLECULE', 'ATOM', 'BOND', 'HEADER']¶
MOL2 categories that are currently implemented.
- moldf.write_mol2.write_mol2(mol2: dict[str, DataFrame], file_name: str | PathLike | None = None) None[source]¶
Write a dict of
Pandas DataFrames into a MOL2 file. See https://is.muni.cz/th/fzk5s/dp_jakub_Vana.pdf p19 for column definitions.- Parameters:
mol2 (required) – a
dictofPandas DataFrames to write.file_name (optional) – file name to write a MOL2 file. If
None,moldf_output.mol2will be used as the file name. Defaults to None.
- Raises:
TypeError – if
mol2is not a valid dict ofDataFrame.ValueError – if the
mol2contains other than supported categories.
JCSV Writer¶
Write any dict of Pandas DataFrame to JCSV.
- moldf.write_jcsv.write_jcsv(data: dict[str, DataFrame], file_name: str | PathLike | None = None, write_meta: bool = True, **kwargs) None[source]¶
Write a dict of
Pandas DataFrames into a JCSV file. See https://github.com/Ruibin-Liu/JCSV for definitions.- Parameters:
data (required) – a
dictofPandas DataFrames to write.file_name (optional) – file name to write a JCSV file. If
None,moldf_output.jcsvwill be used as the file name ifpath_or_bufis not specified in**kwargs. Defaults to None.write_meta (optional) – whether to write meta data into the first category. Currently, only the first line number for each category is recorded. Defaults to True.
**kwargs – keyword arguments for
pd.DataFrame.to_csv. Invalid ones are ignored. Check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
- Raises:
TypeError – if
datais not a valid dict ofDataFrame.ValueError – if
"in any of the column names.
Covalent Bond¶
Gets covalent bonds by covalent radii cutoff or ligand templates.
- moldf.covalent_bond.get_covalent_bond_cutoffs(element_symbols: list | set, single_radii_set: str | None = None) tuple[dict, dict, dict][source]¶
Gets the DataFrame for additive covalent radii of relevant elements.
- Parameters:
element_symbols (required) – a list or set of element symbols whose covalent radii cutoffs are returned.
single_radii_set (optional) – radii sets to use. If
None,single_Cis used as to Cordero (PMID 18478144). Another option issingle_PAwhich refers to Pyykkö’s studies (PMID 19058281;19856342;15832398, and doi:10.1103/PhysRevB.85.024115). Defaults to None.
- Raises:
ValueError – if the
single_radii_setis not valid.- Returns:
- a tuple of the three dictionaries of the distance cutoffs for the pairs of
the queried element symbols: (single_bonds, double_bonds, triple_bonds).
- moldf.covalent_bond.get_residue_template(residue_name: str, parent_name: str | None = None, residue_template_file: str | PathLike | None = None, save_template_file: bool = True, template_file_dir: str | PathLike | None = None) dict[source]¶
Gets the intra- covalent bonds of a residue or ligand.
- Parameters:
residue_name (required) – residue or ligand name to query.
parent_name (optional) – parent name of the residue or ligand. This is useful if the
residue_namein the PDB file is used by RCSB for a different residue or ligand. IfNone, it is the same asresidue_name. Defaults to None.residue_template_file (optional) – residue or ligand template file name/path. If
None, this function queries RCSB by theparent_name. If it’s not a path, this function looks at the current working directory first and thentemplate_file_dirif it is notNone. Defaults to None.save_template_file (optional) – whether to save the downloaded (from RCSB) template file. Defaults to True.
template_file_dir (optional) – directory to save fetched template files. If
Nonebutsave_template_fileisTrue, ‘./template_files’ directory is used. Defaults to None.
- Returns:
- a dictionary of bonds for the residue. The dictionary key format is
(atom_id_1: str, atom_id_2: str), and the dictionary value format is (bond_order: str, is_aromatic: bool, stereo_flag: str)
- Raises:
ValueError – if a
.ciffile for the residue/ligand cannot be downloaded from RCSB for any reason ifresidue_template_fileisNone. Check the url provided in the error message to download manually if possible.FileNotFoundError – if
residue_template_filecannot be found if notNone.
Warning
RuntimeWarning: if the template file contains other residue/ligands.
PDBx Line Splitter¶
- moldf.split_line(line: str, delimiter: str | None = None) list[source]¶
Splits a string line into tokens separated by
delimiters, assuming all'and"in the start character or following a delimiter are paired to quote a token.- Parameters:
line (required) – line as a string
delimiter (optional) –
delimiterto split the line. IfNone,' '(one space) is used. Defaults to None.
- Returns:
A list of tokens