Error with resuming dmrg calculation

Gun · Post by **Gun** » 22 Jul 2024, 12:54

Hello!

I have been using TeNPy for investigating a custom model on a cluster provided by my university. For this reason, I have been using the ‘simulation’ interface of TeNPy. However, I have been having trouble continuing calculations that are interrupted before converging due to the time limits on my jobs.

From the previous posts 1, 2, my understanding is that it is not possible to use resume_from_checkpoint() function from a YAML file, but we need to call it from the terminal or create a Python script.

Python: Select all

import tenpy
import square_MPO #my custom model

h5name = 'dmrg_chi_0500.h5'
tenpy.resume_from_checkpoint(filename=h5name)

However, I get an IndexError stating:

Python: Select all

 Traceback (most recent call last):
  File "/Users/gungunal/Downloads/dmrg_longrange/periodic_7x7/alpha=2.2/lambda=3/chi_0500/results/test/resume.py", line 5, in <module>
    tenpy.resume_from_checkpoint(filename=h5name)
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/simulations/simulation.py", line 1330, in resume_from_checkpoint
    results = sim.resume_run()
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/simulations/simulation.py", line 366, in resume_run
    self.resume_run_algorithm()  # continue with the actual algorithm
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/simulations/ground_state_search.py", line 73, in resume_run_algorithm
    E, psi = self.engine.resume_run()
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/algorithms/algorithm.py", line 158, in resume_run
    return self.run()
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/algorithms/dmrg.py", line 459, in run
    return super().run()
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/algorithms/mps_common.py", line 775, in run
    if self.stopping_criterion(iteration_start_time=iteration_start_time):
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/algorithms/mps_common.py", line 869, in stopping_criterion
    if self.sweeps > min_sweeps and self.is_converged():
  File "/Users/gungunal/Library/Python/3.9/lib/python/site-packages/tenpy/algorithms/dmrg.py", line 405, in is_converged
    E = self.sweep_stats['E'][-1]

Could someone help me understand what might be causing this error and how I can successfully resume my DMRG calculations?

When I load the hdf5 file myself, I can access the last element of

Python: Select all

h5file['sweep_stats']['E'][-1]
-294.56640787413

Thanks in advance for your help!
Gun

Gun · Post by **Gun** » 24 Jul 2024, 00:03

Previously, I had this message as a post-script, but I have decided to submit it as a separate message:

I think I should also provide my attempt at debugging. Feel free to ignore my attempt.

Short explanation:

In the method tenpy.simulations.ground_state_search.GroundStateSearch.init_algorithm, 'resume_data' of the instance is copied. However, 'resume_data' does not include information about energies after each sweep, which is stored in ‘sweep_stats’, but does contain the number of sweeps performed under 'sweeps'.

When attempting to re-run the engine, the first sweep of the re-run tries to check for convergence. This is because the sweep number is read from ‘sweeps’, which is not 1, and ‘min_sweep’, if not defined, defaults to 1. However, it can’t check for convergence in energy because the energy array was deleted in the previous step which causes the error.

Long explanation:

The tenpy.resume_from_checkpoint takes an h5 file as input.

My impression is that the tenpy.simulations.simulation.init_simulation_from_checkpoint initializes the simulation correctly. In my case, I am running a TwoSiteDMRG calculation, so it initializes sim object as an instance of the tenpy.simulations.ground_state_search.GroundStateSearch class.

The first traceback is at line 1330 of simulation.py

Python: Select all

results = sim.resume_run()

Stepping inside the sim.resume_run, it first initializes the model and the state. In my case, the model is imported from another file, and I believe it is initialized correctly. If necessary, I can provide my code. However, I believe the problem occurs at line 360 during the algorithm initialization.

Since my sim is an object of tenpy.simulations.ground_state_search.GroundStateSearch, we need to look at tenpy.simulations.ground_state_search.GroundStateSearch.init_algorithm. This method inherits the init_algorithm method from the Simulation class.

The init_algorithm copies the 'resume_data' to initiliaze the algorithm and afterwards deletes it. Note that results['resume_data'] has information about number of sweeps perfomed so far but has no information about energies found after each sweep is kept under results['sweep_stats']['E'].

Python: Select all

self.results['resume_data']['sweeps']
4
self.results['sweep_stats']['E'][-1]
-294.56640787413

The engine is initialized with the information of sweep number 4 (in my case) and without the information of energies.

Returning to tenpy.simulations.ground_state_search.GroundStateSearch.init_algorithm, after initializing the algorithm using only what is contained in 'resume_data' this function deletes the 'sweep_stats' and 'update_stats'. In other words, before entering the loop at line 59 we can still access the energy values using

Python: Select all

self.results['sweep_stats']['E'][-1]
-294.56640787413

After the loop terminates the 'sweep_stats' are erased.

The algorithm is re-started at line 366 with resume_run_algorithm(). Since I am doing TwoSiteDMRG, a child of tenpy.algorithms.mps_common.IterativeSweeps class, it utilizes the run method.

The problem occurs for me at the line 775 when executing the stopping_criterion method, which is defined at line 834.

As mentioned previously, the 'sweeps' is loaded from the first run and returns 4 (for my case), so the second condition of line 870, which is is_converged, is also checked. However, as energies are discared, energy convergence cannot be checked and the line 405 causes the IndexError

To solve this problem we can either update the min_sweep parameter accordignly when loading the h5 file or keep sweep_stats.

Gun · Post by **Gun** » 24 Jul 2024, 11:51

My dirty workaround solution would be to add a line between 544 and 545 in simulation.py as follows:

Python: Select all

params = self.options.subconfig('algorithm_params') #line 544
if 'min_sweeps' in params: params['min_sweeps']=kwargs['resume_data']['sweeps'] #line that updates the min_sweep condition according to previous calculation
self.engine = AlgorithmClass(self.psi, self.model, params, **kwargs) #line 545

This change updates the min_sweeps condition based on the number of sweeps from the previous calculation, allowing the convergence check to function correctly when resuming the simulation.

Edit 1: This does not work! Because we also need the entropy values from previous iterations when performing sweeps with run_iteration() method.

Edit 2: Maybe the previous fix I proposed could work for others. [strikeout]But in my case, the wavefunction loaded from the previous save raises ValueError("entropy with non-diagonal schmidt values").

I am confused why this is happening. For the same model, I can have more sweeps if the job is not cancelled due to the time limit. However, when it is canceled due to the time limit, I can’t restart it. The problem occurs when I load the wavefunction; psi.entanglement_entropy() raises the non-diagonal schmidt values error.[/strikeout]

Edit 3: [strikeout]My suspicion is that the problem is caused by how the jobs are killed when they reach the time limit by SLURM on the HPC cluster that I am using.[/strikeout] I have run some low-effort calculations on my laptop and specified in the parameters a maximum number of sweeps. After applying the “fix” mentioned above and commenting out the line that checks if a calculation is “finished,” I successfully managed to continue the calculations. If interested, I can provide the custom model, simulation parameters, and log files.

Edit 4: I could not reproduce the error mentioned in edit 2-3. So probably the reasons I was getting that error was due to a corrupted h5 file.

Gun · Post by **Gun** » 31 Dec 2024, 14:14

Hi everyone,

I’m running DMRG on an HPC that allows jobs up to 24 hours. I request 24 hours of walltime on a single node. In my YML file, I set 'max_hours=20' to provide TenPY a buffer to successfully finish the calculation so that the job is not killed in mid sweep.

After the run stops, I get an .h5 file called 'dmrg_chi_0500.h5' which I use as an input to the ' tenpy.resume_from_checkpoint' function.

Upon resuming the calculations I get:

Code: Select all

ValueError: entropy with non-diagonal schmidt values

Has anyone else encountered this issue?

TeNPy Forum

Error with resuming dmrg calculation

Error with resuming dmrg calculation

Re: Error with resuming dmrg calculation

Re: Error with resuming dmrg calculation

Re: Error with resuming dmrg calculation