Source code for ocelot.routines.disparser

from collections import OrderedDict
from itertools import groupby

from pymatgen.core.sites import Site
from pymatgen.core.structure import Lattice
from pymatgen.core.structure import Structure
from pymatgen.core.structure import lattice_points_in_supercell

from ocelot.routines.disparser_functions import *
from ocelot.routines.pbc import PBCparser
from ocelot.schema.conformer import ConformerError
from ocelot.schema.conformer import MolConformer

"""
DisParser: prepare_data cif file into a list of disorder-free configurations

## notions and terms

- cifstring: 
    the raw string of the cif file
    
- configuration: 
    a disorder-free structure describing the crystal, 
    it has an occupancy,
    it could be a supercell
    
- unit_configuration: 
    a configuration that is not a supercell
    
- asymmetric_unit: 
    the atomic sites specified in cifstring, 
    please notice that the real, crystallographic asymmetric unit may be a subset of asymmetric_unit defined here

- disordered_sites:
    every atomic site with disorder

- counterpart:
    if atomic site X cannot co-exist with Y, Y is a counterpart of X and vice versa

- disorder_group:
    the largest subset of disordered_sites, 
    the atomic sites in disorder_unit can co-exist,
    *usually* there are two disorder_groups in a disordered cif file
    
- disorder_group_label: 
    an atomic site may be assigned a string to indicate the disorder_group it belongs to 
    defaults are ".", "1", "2".
    
- disorder_unit:
    a subset of a disorder_group
    each atomic site in the disorder_unit must share the same occupancy
    the sites in disorder_unit must be chemically connected/related (subject to a cutoff).

- disorder_portion:
    a list of disorder_units, each of them belongs to a different disorder_group
    in a certain configuration only one of the disorder_units will present to represent the disorder_portion

## disorder represenstation

- situation a: 

    the cif file may contain the following disorder-related fields

    _atom_site_occupancy 
    _atom_site_disorder_group 
    
    if both of the two keys exist, then we can extract configuration from the cif file with explicit occupancies
    if only **_atom_site_disorder_group** exists, **_atom_site_occupancy** will be set to 0.5 for disordered sites
    
- situation b:

    if not, it is still possible to prepare_data the cif file contains disorder, 
    this depends on the the values of **_atom_site_label**, 
    btw, '_atom_site_label' may not exist when using jmol to write cif

    the values of **_atom_site_label** form a list of *unique* labels, a label can be considered as a concatenation of 
    
    **E (an element symbol) + I (an index) + T (the third tag)**
    
    while CSD claims the alternative configurations should be labelled 
    with a suffix question mark (?), this is certainly not true for e.g. ASIXEH
    
#### classifications of situation b. cif files (that I've seen)

    1. EI is unique for each site, 
        only two disorder_groups
        sites of one disorder_group (the minor configuration) share the same T
        e.g. ALOVOO
        
    2. EI is **not** unique for each site, 
        only two disorder_groups
        sites of one disorder_group (the minor configuration) share the same T
        a disordered site has only one counterpart, and they share the same EI
        e.g. x17059
    
    3. EI is **not** unique for each site, 
        there is no site with a non-word T 
        Sites sharing the same EI representing disorder, 
        e.g. ASIXEH
    
    4. disordered sites are simply ignored in the cif file 
        s.t. the molecule is not complete, 
        e.g. ABEGET01
    
    5. disordered sites are simply ignored in the cif file, 
        but the molecule looks legit, 
        e.g. AGAJUO
    
    6. disordered sites are present in the cif file 
        but not labeled at all, e.g. ANOPEA

    notes: in case 1. it is possible to get the occupancies via CSD API, not always tho.

I cannot find a way to process 3., 4., 5., 6. without checking deposition records 
or chemistry analysis, so this parser will ignore them. 
That is, there can be only one alternative configuration.

## convert b1/b2 to a.
The idea is to find disorder_groups in b1 and b2, 
then artificially set (average) **_atom_site_occupancy** and **_atom_site_disorder_group**, 
this should be done intentionally.


## parsing a.

1. pairing/subgrouping disordered sites in an asymmetric unit:

    i. for each disordered atomic site, 
        find its counterpart if the other site is disordered and share the same EI
        
    ii. if i. failed at leaset once, find the nearest disordered site that share the same element

2. find disorder units:

    within one disorder_group, get connected components as disorder_units, 
    then find their pairs to form disorder_portion

3. config generation:
    
    - config_asymm = asymm_inv + asymm_disorder_portion1[0] + asymm_disorder_portion2[1] + ...
    - config_cell = config_asymm1 + config_asymm2 + ... 
    - config = config_cell1 + config_cell2 + ...
    
    the instruction for generating a configuration is a dictionary:
    
        conf_instruction[icell][iasymm][idisorder_portion] = j

    configuration as a cif file will be written with keys:
    ['_cell_length_a'],
    ['_cell_length_b'],
    ['_cell_length_c'],
    ['_cell_angle_alpha'],
    ['_cell_angle_beta'],
    ['_cell_angle_gamma'],
    ['_space_group_symop_operation_xyz'] or other symm labels
    ['_atom_site_label'],
    ['_atom_site_type_symbol'],
    ['_atom_site_fract_x'],
    ['_atom_site_fract_y'],
    ['_atom_site_fract_z'],

Ref:
[CSD API convention](https://downloads.ccdc.cam.ac.uk/documentation/API/descriptive_docs/crystal.html#disorder)  

"""


[docs]class DisorderParserError(Exception): pass
[docs]class AtomLabel:
[docs] def __init__(self, label: str): """ a class for atom label in cif file, overkill I guess... :param label: """ self.label = label tmplabel = self.label self.element = re.findall(r"^[a-zA-Z]+", tmplabel)[0] tmplabel = tmplabel.lstrip(self.element) try: self.index = re.findall(r"^\d+", tmplabel)[0] except IndexError: self.index = "0" tmplabel = tmplabel.lstrip(str(self.index)) self.tag = tmplabel # if len(self.tag) > 1: # raise ValueError('tag for {} is {}, this is unlikely'.format(label, self.tag)) self.index = int(self.index) self.ei = "{}{}".format(self.element, self.index)
def __str__(self): return self.label def __repr__(self): return self.label def __hash__(self): return hash(self.label) def __eq__(self, other): return self.label == other.label
[docs] @staticmethod def get_labels_with_tag(tag, als): sametag = [] for alj in als: if tag == alj.tag: sametag.append(alj) return sametag
[docs] def get_labels_with_same_ei(self, als): """ get a list of atomlabel whose ei == al.ei :param als: :return: """ sameei = [] for alj in als: if self.ei == alj.ei and self != alj: sameei.append(alj) return sameei
[docs] @staticmethod def get_psite_by_atomlable(psites: [PeriodicSite], al): for s in psites: if s.properties['label'] == str(al): return s raise ValueError('cannot find psite with atomlable {}'.format(str(al)))
[docs] @classmethod def from_psite(cls, s: PeriodicSite): return cls(s.properties['label'])
[docs]class AsymmUnit:
[docs] def __init__(self, psites: [PeriodicSite]): """ an asymmetric unit with paired disorder units :param psites: # :param supress_sidechain_disorder: can be a function takes a psite and returns bool, # if True the psite will always be considered as a non-disordered site """ self.sites = psites self.labels = [AtomLabel(s.properties['label']) for s in self.sites] self.symbols = [l.element for l in self.labels] self.composition = tuple(sorted(self.symbols)) # make sure occu and label and disg are present in prop for s in self.sites: for k in ['occu', 'label', 'disg']: if k not in s.properties.keys(): raise KeyError('{} is not in the properties of site {}!'.format(k, s)) self.inv = [] self.disordered_sites = [] for ps in self.sites: if ps.properties['disg'] == '.' or abs(ps.properties['occu'] - 1) < 1e-5: self.inv.append(ps) else: self.disordered_sites.append(ps) # group by disg disordered_groups = [list(v) for l, v in groupby(sorted(self.disordered_sites, key=lambda x: x.properties['disg']), lambda x: x.properties['disg'])] self.disordered_groups = {} # within one group, group again by connection for i in range(len(disordered_groups)): units = [] group = disordered_groups[i] disg_label = group[0].properties['disg'] block_list = find_connected_psites(group) """ the problem here is it thinks a(a')-b-c(c')-d(d') has two pairs of disunit, as there is no disorder at b if a pblock is not far away from another, they should be one pblock """ connected_blocks = get_connected_pblock(block_list, cutoff=4.0) for pblock in connected_blocks: units.append(DisorderUnit(pblock, disg=disg_label)) self.disordered_groups[disg_label] = units # pairing self.disordered_portions = [] init_disg_label = sorted(list(self.disordered_groups.keys()))[0] for u1 in self.disordered_groups[init_disg_label]: portion = [u1] for k in self.disordered_groups.keys(): if k != init_disg_label: u2 = DisorderUnit.find_counterpart(u1, self.disordered_groups[k]) portion.append(u2) self.disordered_portions.append(portion)
[docs] def infodict(self, is_sidechain=None): au_labels = self.labels au_inv_labels = [AtomLabel(ps.properties['label']) for ps in self.inv] disordered_portions = [] if is_sidechain is None: disordered_portions = [[u for u in portion] for portion in self.disordered_portions] else: for portion in self.disordered_portions: p = [] for disorderunit in portion: if all(is_sidechain(l) for l in disorderunit.labels): au_inv_labels += disorderunit.labels break else: p.append(disorderunit) if len(p) != 0: disordered_portions.append(p) return au_labels, au_inv_labels, disordered_portions
[docs]class DisorderUnit: def __repr__(self): return "DisorederUnit -- disg: {}\n".format(self.disg) + " ".join([str(l) for l in self.labels]) def __str__(self): return self.__repr__() def __eq__(self, other): return set(self.labels) == set(other.labels) def __len__(self): return len(self.sites) def __hash__(self): return hash(set(self.labels))
[docs] def is_likely_counterpart(self, other): if not isinstance(other, DisorderUnit): return False if self == other: return False if self.composition != other.composition: return False if set(self.eis) == set(other.eis): return True if abs(self.occu + other.occu - 1) < 1e-5: return True
[docs] @staticmethod def find_counterpart(u1, u2s): """ :type u1: DisorderUnit :type u2s: [DisorderUnit] """ k = lambda x: np.linalg.norm(x.geoc - u1.geoc) potential_u2 = sorted(u2s, key=k) potential_u2 = [u2 for u2 in potential_u2 if u1.is_likely_counterpart(u2)] try: u2 = potential_u2[0] if np.linalg.norm(u2.geoc - u1.geoc) > 1.5: warnings.warn('at least one disunit is paired with another that is >1.5 A far away, unlikely') return u2 except IndexError: raise DisorderParserError('cannot find counterpart for DisorderUnit: {}'.format(u1))
[docs] def __init__(self, psites: [PeriodicSite], disg: str): """ a portion of an asymmetric unit representing one possibility a pair of DisUnit is the basis of disorder, assuming maximal entropy """ self.disg = disg self.sites = psites self.labels = [AtomLabel(s.properties['label']) for s in self.sites] self.eis = [al.ei for al in self.labels] self.symbols = [l.element for l in self.labels] self.composition = tuple(sorted(self.symbols)) occus = [s.properties['occu'] for s in self.sites] disgs = [s.properties['disg'] for s in self.sites] if len(set(occus)) != 1: raise DisorderParserError('occu is not uniform for {}'.format(str(self))) if len(set(disgs)) != 1: raise DisorderParserError('disg is not uniform for {}'.format(str(self))) self.occu = occus[0] self.disg = disgs[0]
@property def geoc(self): """ in cart """ c = np.zeros(3) for s in self.sites: c += s.coords return c / len(self.sites)
[docs]class DisParser: # chaos parser sounds cooler?
[docs] def __init__(self, cifstring: str): """ this can only handle one alternative configuration for the asymmetric unit one asymmetric unit == inv_conf + disg1 + disg2 disg1 = disunit_a + disunit_b + ... disg2 = disunit_a' + disunit_b' + ... DisorderPair_a = disunit_a + disunit_a' if the cif file contains previouly fitted occu and disg, we call it dis-0 and we use occu/disg info to get inv_conf, disg1, disg2 if there is no previously fitted info, we deal with the following situations: dis-1: set(self.tags) is ["", <non-word>, <word>, ...], EI<non-word> -- EI, note x17059.cif is dis-1 but it has hydrogens like H12A -- H12D, this can only be captured by previously fitted disg and occu, dis-2: set(self.tags) is ["", <non-word>, <word>, ...], EI<non-word> -- E'I' e.g. ALOVOO.cif nodis-0: no dup in self.eis, set(self.tags) is {""} nodis-1: dup in self.eis, set(self.tags) is ["", <word>, ...], this could be a dis as in ASIXEH weird: else for dis-1, dis-2, we fill the occu, disg fields in cifdata, so they can be coonverted to dis-0 Attributes: data[atomlabel] = [x, y, z, symbol, occu, disgrp] this will be used to write config cif file """ # prepare_data into cifdata self.classification = None self.cifstring = cifstring self.identifier, self.cifdata = get_pmg_dict(self.cifstring) self.cifdata['_atom_site_fract_x'] = [braket2float(x) for x in self.cifdata['_atom_site_fract_x']] self.cifdata['_atom_site_fract_y'] = [braket2float(x) for x in self.cifdata['_atom_site_fract_y']] self.cifdata['_atom_site_fract_z'] = [braket2float(x) for x in self.cifdata['_atom_site_fract_z']] for i in range(len(self.cifdata['_atom_site_type_symbol'])): if self.cifdata['_atom_site_type_symbol'][i] == 'D': warnings.warn('D is considered as H in the ciffile _atom_site_type_symbol!') self.cifdata['_atom_site_type_symbol'][i] = 'H' # check cif file try: labels = self.cifdata['_atom_site_label'] except KeyError: raise CifFileError('no _atom_site_label field in the cifstring!') if len(labels) != len(set(labels)): warnings.warn('duplicate labels found in the cifstring, reassign labels!') for i in range(len(self.cifdata['_atom_site_label'])): label = AtomLabel(self.cifdata['_atom_site_label'][i]) self.cifdata['_atom_site_label'][i] = label.element + str(i) + label.tag # raise CifFileError('duplicate labels found in the cifstring!') # "global" info self.labels = [AtomLabel(lab) for lab in labels] self.eis = [al.ei for al in self.labels] self.tags = [al.tag for al in self.labels] self.latparams = [self.cifdata[k] for k in latt_labels] self.lattice = Lattice.from_parameters(*[braket2float(p) for p in self.latparams], True) if '_atom_site_disorder_group' in self.cifdata.keys(): self.was_fitted = True # deal with e.g. k06071 for i in range(len(self.cifdata['_atom_site_disorder_group'])): if self.cifdata['_atom_site_disorder_group'][i] != ".": dv = self.cifdata['_atom_site_disorder_group'][i] dv = abs(int(dv)) self.cifdata['_atom_site_disorder_group'][i] = dv disgs = self.cifdata['_atom_site_disorder_group'] disg_vals = list(set([disg for disg in disgs if disg != "."])) disg_vals.sort() for i in range(len(self.cifdata['_atom_site_disorder_group'])): for j in range(len(disg_vals)): dv = disg_vals[j] if self.cifdata['_atom_site_disorder_group'][i] == dv: self.cifdata['_atom_site_disorder_group'][i] = str(j + 1) break if '_atom_site_occupancy' not in self.cifdata.keys(): occus = [] for disg in self.cifdata['_atom_site_disorder_group']: if disg == '.': occus.append(1) else: if int(disg) & 1: occus.append(0.51) else: occus.append(0.49) self.cifdata['_atom_site_occupancy'] = occus else: self.was_fitted = False if self.was_fitted: self.cifdata['_atom_site_occupancy'] = [braket2float(x) for x in self.cifdata['_atom_site_occupancy']] # self.cifdata['_atom_site_disorder_group'] = [braket2float(x) for x in # self.cifdata['_atom_site_disorder_group']] # prepare self.data to be used in parsing data = map(list, zip( self.cifdata['_atom_site_fract_x'], self.cifdata['_atom_site_fract_y'], self.cifdata['_atom_site_fract_z'], self.cifdata['_atom_site_type_symbol'], )) data = list(data) data = OrderedDict(zip(self.labels, data)) self.data = data for i in range(len(self.labels)): al = self.labels[i] if self.was_fitted: self.data[al].append(self.cifdata['_atom_site_occupancy'][i]) self.data[al].append(self.cifdata['_atom_site_disorder_group'][i]) else: self.data[al].append(None) self.data[al].append(None)
[docs] def get_psites_from_data(self): """ get psites from self.data, each psite will be assigned properties with fields occu disg label :return: """ ps = [] for k in self.data.keys(): x, y, z, symbol, occu, disgrp = self.data[k] if occu is None or disgrp is None: raise DisorderParserError('getting psites with None occu/disgrp') ps.append(PeriodicSite(symbol, [x, y, z], properties={'occu': occu, 'disg': disgrp, 'label': k.label}, lattice=self.lattice)) return ps
[docs] @classmethod def from_ciffile(cls, fn): with open(fn, 'r') as f: s = f.read() return cls(s)
[docs] def classify(self): """ one cif file belongs to one of the following categories: dis-alpha: was fitted with, at least, disorder group nodis-0: no dup in eis, one type of unique tag dis-beta: dup in eis, EIx -- EIy, x or y can be empty, e.g. c17013.cif dis-gamma: no dup in eis, EIx -- EJy, x or y can be empty, e.g. ALOVOO.cif weird: else note x17059.cif has hydrogens like H12A -- H12D, this can only be captured by previously fitted disg and occu (dis-a), in general there should be a check on whether disgs identified by the parser are identical to previously fitted """ if self.was_fitted: return "dis-alpha" # we look at labels of non-hydrogen sites labels = [l for l in self.labels if l.element != 'H'] tags = [l.tag for l in labels] eis = [l.ei for l in labels] unique_tags = list(set(tags)) unique_eis = list(set(eis)) if len(unique_tags) == 1 and len(unique_eis) == len(eis): for al in self.labels: self.data[al][4] = 1 self.data[al][5] = '.' return "nodis-0" u_tags = list(set(self.tags)) u_nonalphabet_nonempty_tags = [t for t in u_tags if not re.search(r"[a-zA-Z]", t) and t != ""] if len(u_nonalphabet_nonempty_tags) == 1: minor_tag = u_nonalphabet_nonempty_tags[0] minor_als = AtomLabel.get_labels_with_tag(minor_tag, self.labels) disg2 = [] disg1 = [] classification = "" # try pairing based on ei, EIx -- EIy for minor_al in minor_als: potential_aljs = [l for l in self.labels if l.ei == minor_al.ei and l != minor_al] if len(potential_aljs) == 0: classification = "dis-gamma" break potential_aljs.sort(key=lambda x: x.tag) alj = potential_aljs[0] disg1.append(alj) disg2.append(minor_al) if classification == "": classification = "dis-beta" else: # EIx -- EJy disg2 = [] disg1 = [] cutoff = 2.5 warnings.warn( 'W: trying to prepare_data disorder in dis-gamma with cutoff {}, this is unreliable!'.format( cutoff)) major_als = [l for l in self.labels if l not in minor_als] assigned_major = [] for minor_al in minor_als: alj = self.get_nearest_label(minor_al, [l for l in major_als if l not in assigned_major], self.lattice, cutoff) assigned_major.append(alj) disg1.append(alj) disg2.append(minor_al) classification = "dis-gamma" inv = [l for l in self.labels if l not in disg1 + disg2] for al in inv: self.data[al][4] = 1 self.data[al][5] = '.' for al in disg1: self.data[al][4] = 0.51 self.data[al][5] = '1' for al in disg2: self.data[al][4] = 0.49 self.data[al][5] = '2' return classification if len(unique_tags) > 0 and len(eis) != len(unique_eis): # try pairing based on ei, EIx -- EIy for ei in self.eis: als = [l for l in self.labels if l.ei == ei] if len(als) == 1: al = als[0] self.data[al][4] = 1 self.data[al][5] = '.' else: als.sort(key=lambda x: x.tag) ali = als[0] alj = als[1] self.data[ali][4] = 0.51 self.data[ali][5] = '1' self.data[alj][4] = 0.49 self.data[alj][5] = '2' return "dis-beta" raise NotImplementedError("this structure is classified as weird, better to check ciffile")
[docs] def get_nearest_label(self, ali: AtomLabel, neighbor_labels: [AtomLabel], lattice: Lattice, cutoff=2.5): """ given a list of potential neighbouring AtomLable, get the one that is closest and has the same element """ data = self.data ali_nbs_dictance = [] xi, yi, zi, symboli, occu, disgrp = data[ali] fci = [xi, yi, zi] for alj in neighbor_labels: if alj == ali: continue xj, yj, zj, symbolj = data[alj][:4] fcj = [xj, yj, zj] if symboli != symbolj: continue v, d2 = pbc_shortest_vectors(lattice, fci, fcj, return_d2=True) dij = math.sqrt(d2[0, 0]) ali_nbs_dictance.append([alj, dij]) if len(ali_nbs_dictance) == 0: raise ValueError( 'cannot find any same-symbol neighbors of label {}'.format(ali.label)) ali_nbs_dictance = sorted(ali_nbs_dictance, key=lambda x: x[1]) nnb, nnbdis = ali_nbs_dictance[0] if nnbdis > cutoff: raise ValueError( 'cannot find any same-symbol neighbors of label {} within {}'.format( ali.label, cutoff)) return nnb
[docs] @staticmethod def data2cifdata(data, cifdata): """ data[atomlable] = x, y, z, symbol, occu, group """ newdata = OrderedDict() for ciflabel in possible_symm_labels + latt_labels + chemistry_labels: if ciflabel in cifdata.keys(): newdata[ciflabel] = cifdata[ciflabel] labs = list(data.keys()) xs = [] ys = [] zs = [] symbols = [] occus = [] idisgs = [] for lab in labs: x, y, z, symb, occu, idisg = data[lab] xs.append(x) ys.append(y) zs.append(z) symbols.append(symb) occus.append(occu) idisgs.append(idisg) newdata['_atom_site_label'] = [l.label for l in labs] newdata['_atom_site_type_symbol'] = symbols newdata['_atom_site_fract_x'] = xs newdata['_atom_site_fract_y'] = ys newdata['_atom_site_fract_z'] = zs newdata['_atom_site_occupancy'] = occus newdata['_atom_site_disorder_group'] = idisgs return newdata
[docs] def prepare_data(self): self.classification = self.classify() print('{} thinks this cif file belongs to class {}'.format(self.__class__.__name__, self.classification))
# from pprint import pprint # pprint(self.data)
[docs] @staticmethod def get_vanilla_configs(psites: [PeriodicSite]): """ must have 'disg' in properties """ disg_vs = [] for i in range(len(psites)): disg = psites[i].properties['disg'] disg_vs.append(disg) disg_vs = list(set(disg_vs)) disg_vs.sort() if len(disg_vs) == 1: return [Structure.from_sites(psites)] disgs = [] inv = [s for s in psites if s.properties['disg'] == disg_vs[0]] for disg in disg_vs[1:]: disg_sites = [] for s in psites: if s.properties['disg'] == disg: disg_sites.append(s) disgs.append(Structure.from_sites(disg_sites + inv)) return disgs
[docs] @staticmethod def get_site_location_by_key(psites: [PeriodicSite], key='label'): """ loc[key] = 'bone'/'sidechain' must have 'imol' 'disg' 'siteid' and <key> assigned """ res = {} k = lambda x: x.properties['imol'] psites.sort(key=k) for imol, group in groupby(psites, key=k): obc_sites = [] for ps in group: obc_sites.append(Site(ps.species_string, ps.coords, properties=ps.properties)) try: molconformer = MolConformer.from_sites(obc_sites, siteids=[s.properties['siteid'] for s in obc_sites]) except ConformerError: raise DisorderParserError('cannot init legit conformer to identify site location') for sid in molconformer.siteids: site = molconformer.get_site_byid(sid) if molconformer.backbone is None: res[site.properties[key]] = 'sidechain' else: if sid in molconformer.backbone.siteids: res[site.properties[key]] = 'bone' else: res[site.properties[key]] = 'sidechain' return res
[docs] def to_configs(self, write_files=False, scaling_mat=(1, 1, 1), assign_siteids=True, vanilla=True, supressdisorder=True): """ return pstructure, pmg structure is the unit cell structure with all disordered sites unwrap_str, pmg unwrap structure, with all disordered sites mols, a list of pmg mol with disordered sties confs, [[conf1, occu1], ...], conf1 is a clean structure """ self.prepare_data() # edit self.data asym_psites = self.get_psites_from_data() # assign fields: occu, disg, label raw_symmops = get_symmop(self.cifdata) psites, symmops = apply_symmop(asym_psites, raw_symmops) # assign field: iasym pstructure = Structure.from_sites(psites, to_unit_cell=True) if write_files: pstructure.to('cif', 'dp_wrapcell.cif') # assign siteids here if unit cell if np.prod(scaling_mat) == 1.0 and assign_siteids: print('siteid is assigned to unitcell in dp.to_configs') for isite in range(len(pstructure)): pstructure[isite].properties['siteid'] = isite # sc, n_unitcell = ConfigConstructor.build_supercell_full_disorder(pstructure, scaling_mat) mols, unwrap_str, unwrap_pblock_list = PBCparser.unwrap_and_squeeze(pstructure) # assign field: imol if write_files: unwrap_str.to('cif', 'dp_unwrapunit.cif') sc, n_unitcell = ConfigConstructor.build_supercell_full_disorder(unwrap_str, scaling_mat) # assign field: icell if write_files: sc.to('cif', 'dp_supercell.cif') if np.prod(scaling_mat) != 1.0 and assign_siteids: warnings.warn( 'siteid is assigned to supercell in dp.to_configs, notice in this case, sites in mols, unwrap_str, unwrap_pblock_list mols do not have siteids!' ) for isite in range(len(sc)): sc[isite].properties['siteid'] = isite if vanilla: iconf = 0 conf_structures = self.get_vanilla_configs(sc.sites) # # it is possible to have the situation where unwrapping disordered sites lead to super-large molecule # # e.g. x15029, this can be dealt with unwrpa again, I do not think this is common tho. # # if use this you may want to reassign imol field # for c in conf_structures: # _, struct, _ = PBCparser.unwrap(c) # unwrap_again.append(struct) # confs = [[c, 0.5] for c in conf_structures] def get_average_occu(structure: Structure): try: return np.mean([s.properties['occu'] for s in structure.sites]) except KeyError: return 0.5 confs = [[c, get_average_occu(c)] for c in conf_structures] if write_files: for conf in conf_structures: conf.to('cif', 'vconf_{}.cif'.format(iconf)) # pymatgen somehow does not write disg field in the cif iconf += 1 if len(set(c.composition for c in conf_structures)) > 1: warnings.warn('different compositions in confs! better use vconf_0!') # raise DisorderParserError('different compositions in confs!') return pstructure, unwrap_str, mols, sorted(confs, key=lambda x: x[1], reverse=True) if supressdisorder: conf_structures = self.get_vanilla_configs(sc.sites) locbysitelabel = {} for conf in conf_structures: try: loc_conf = self.get_site_location_by_key(conf.sites, 'label') except DisorderParserError: warnings.warn('cannot get site location, supressdisorder is disabled') loc_conf = {} for k in loc_conf: locbysitelabel[k] = loc_conf[k] def is_sidechain(label: AtomLabel): if locbysitelabel[str(label)] == 'sidechain': return True return False else: is_sidechain = None iconf = 0 confs = [] conf_structures = [] au = AsymmUnit(asym_psites) au_labels, au_inv_labels, disordered_portions = au.infodict(is_sidechain) inv_labels = [l.label for l in au_inv_labels] conf_ins = ConfigConstructor.gen_instructions(disordered_portions, len(symmops), n_unitcell) for confin in conf_ins: conf, conf_occu = ConfigConstructor.dissc_to_config(sc, inv_labels, disordered_portions, confin) confs.append([conf, conf_occu]) conf_structures.append(conf) if write_files: conf.to('cif', 'conf_{}.cif'.format(iconf)) # pymatgen somehow does not write disg field in the cif iconf += 1 if len(set(c.composition for c in conf_structures)) > 1: warnings.warn('different compositions in confs! better use vconf_0!') # raise DisorderParserError('different compositions in confs!') return pstructure, unwrap_str, mols, sorted(confs, key=lambda x: x[1], reverse=True)
[docs]class ConfigConstructor:
[docs] @staticmethod def gen_instructions(asym_portions, n_asym, n_cell): """ get all possible instructions, notice this is exponentially scaled # of configs = 2^n_port*n_asym*n_cell where 2 comes from pairwise disordered occupancies """ pair_size = [] for ipair in range(len(asym_portions)): pair_size.append(len(asym_portions[ipair])) # should always be 2 dis_ins_combinations = itertools.product(*[list(range(n)) for n in pair_size]) results = itertools.product(dis_ins_combinations, repeat=n_asym) results = list(itertools.product(results, repeat=n_cell)) return results
[docs] @staticmethod def build_supercell_full_disorder(pstructure, scaling_matrix): """ get a supercell with all disordered sites inside, should be used to generate a certain config based on instruction :param pstructure: :param scaling_matrix: :return: """ scale_matrix = np.array(scaling_matrix, np.int16) if scale_matrix.shape != (3, 3): scale_matrix = np.array(scale_matrix * np.eye(3), np.int16) new_lattice = Lattice(np.dot(scale_matrix, pstructure._lattice.matrix)) f_lat = lattice_points_in_supercell(scale_matrix) c_lat = new_lattice.get_cartesian_coords(f_lat) new_sites = [] for site in pstructure.sites: icell = 0 for v in c_lat: site_properties = deepcopy(site.properties) site_properties['icell'] = icell s = PeriodicSite(site.species, site.coords + v, new_lattice, properties=site_properties, coords_are_cartesian=True, to_unit_cell=False) new_sites.append(s) icell += 1 new_charge = pstructure._charge * np.linalg.det(scale_matrix) if pstructure._charge else None return Structure.from_sites(new_sites, charge=new_charge), len(c_lat)
[docs] @staticmethod def dissc_to_config(sc: Structure, inv_labels: [str], disportions, instruction): """ take instructions to generate a certain config from super cell """ pool = [] config_occu = 1 for icell in range(len(instruction)): for iasym in range(len(instruction[icell])): for iport in range(len(instruction[icell][iasym])): idis = instruction[icell][iasym][iport] disu = disportions[iport][idis] config_occu *= disu.occu for label in disu.labels: pool.append((label.label, iasym, icell)) conf_sites = [] for ps in sc.sites: prop = ps.properties if prop['label'] in inv_labels: conf_sites.append(ps) else: plabel = prop['label'] piasym = prop['iasym'] picell = prop['icell'] if (plabel, piasym, picell) in pool: conf_sites.append(ps) return Structure.from_sites(conf_sites), config_occu