Strategies to run large-memory calculations?

DiracString · Post by **DiracString** » 15 Jul 2024, 20:31

Hi,
I am trying to run some DMRG simulations on a cluster. However, I eventually run out of memory. In the slurm file, I get the error:

Python: Select all

Traceback (most recent call last):
  File "/mmfs1/gscratch/.../conda_env/lib/python3.7/site-packages/tenpy/algorithms/dmrg.py", line 1831, in mix_rho_L
    LHeff = engine.LHeff
AttributeError: 'TwoSiteDMRGEngine' object has no attribute 'LHeff'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 191, in <module>
    info = eng.run() # the main work; modifies psi in place
....
....
File "tenpy/linalg/_npc_helper.pyx", line 684, in tenpy.linalg._npc_helper.Array_itranspose_fast
numpy.core._exceptions.MemoryError: Unable to allocate 22.5 GiB for an array with shape (19431, 2, 1, 19431, 2) and data type complex128

Due to unfamiliarity with parallelization, the only thing I am doing right now is having the following line before loading python and running the python script in my bash script:

Python: Select all

export OMP_NUM_THREADS=192

As for the nodes I used, I tried both of the following but get similar errors:

Python: Select all

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=96
#SBATCH --ntasks=192
#SBATCH --mem=1495GB

and

Python: Select all

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=96
#SBATCH --ntasks=1
#SBATCH --mem=1495GB

My script code is quite standard, with the simulation actually running using:

Python: Select all

# added below, changes psi
eng = dmrg.TwoSiteDMRGEngine(psi, M, dmrg_params)
info = eng.run() # the main work; modifies psi in place

E = info[0]
psi = info[1]

Do you have any suggestions on how I could make this job run? Is there a way to perhaps write arrays to disk during the process instead of doing everything in memory? Or some other strategy?

Thanks!

Johannes · Post by **Johannes** » 25 Sep 2024, 15:19

I guess I'm late for you to solve the problem, but maybe as a hint for people trying to solve it in the future:
The #SBATCH --nodes=2 asks SLURM for 2 nodes with MPI paralellization. TeNPy can not utilize that (unless you're working in the mpi_parallel branch, which is for very specific use cases only), so doing that you only request double the amount of resources from the cluster than you can actually use.

What TeNPy can do is indeed to cache some intermediate data to disk - the big part in Memory are the DMRG environments, and we support caching them using the cache system in tenpy.tools.cache, see init_cache and open.
Using the Simulation setup, you "just" need to add the following parameters (globally in your simulation parameters):

YAML parameters: Select all

cache_params:
    storage_class: PickleStorage
    use_threading: True  # reduce the OMP_NUM_THREADS if you use this!
    directory: "/scratch/user_12abc/tmp/cache" # specify `directory` or `tmpdir` on the cluster node's local file system (not network storage!!!)
...  # other parameters like dmrg_params etc

Make sure you create that directory/tmpdir on the cluster node in a path unique to your user, and clean it up / remove it at the end of your job - ideally before and after the python call from your SLURM job script.

jama7168 · Post by **jama7168** » 25 Mar 2025, 10:38

Hej,

I am new to this forum and I am facing a vey similar problem to the one mentioned here. I tried to add this

YAML parameters: Select all

cache_params:
    storage_class: PickleStorage
    use_threading: True  # reduce the OMP_NUM_THREADS if you use this!
    directory: "/scratch/user_12abc/tmp/cache" # specify `directory` or `tmpdir` on the cluster node's local file system (not network storage!!!)
...  # other parameters like dmrg_params etc

to my Configuration file. However, when I check the used RAM on the cluster with and without the cache it does not seem to make a difference. In both cases the RAM usage is around 91GB.
This is what I have in my parameters file:

YAML parameters: Select all

cache_params:
    storage_class: PickleStorage
    use_threading: True  # reduce the OMP_NUM_THREADS if you use this!
    # further specify `directory` or `tmpdir` on the cluster node's local file system
    tmpdir: /scratch/local/tmp 
    delete: True
cache_threshold_chi: 600  # use cache for chi larger than that

In the log file it looks like this:

YAML parameters: Select all

INFO    : use trivial cache (keeps everything in RAM)
INFO    : GroundStateSearch: reading 'algorithm_params'={'mixer': True, 'mixer_params': {'amplitude': 1e-05, 'decay': 2.0, 'disable_after': 5}, 'chi_list': {0: 10, 2: 50, 5: 800}, 'lanczos_params': {'N_max': 3, 'N_min': 3, 'N_cache': 20, 'reortho': False}, 'max_E_err': 1e-07, 'N_sweeps_check': 1, 'max_hours': 72, 'combine': True}
INFO    : GroundStateSearch: reading 'cache_params'={'storage_class': 'PickleStorage', 'use_threading': True, 'tmpdir': '/scratch/local/tmp', 'delete': True}
INFO    : new non-trivial cache with storage PickleStorage
INFO    : PickleStorage: create directory /scratch/local/tmp/tenpy_cache_PickleStoragep8gg4gzg
INFO    : tenpy worker thread starting

It seems that first the trivial cache is created and then later the PickleStorage. Is that correct or is the problem somewhere in the initialization of the cache.
Do you have any recommendations on what to do and how to make it works.

My full simulation file looks like this:

YAML parameters: Select all

simulation_class : GroundStateSearch

directory: results
output_filename_params:
    prefix: two_site_dmrg
    parts:
        model_params.L: 'L_{0:d}'
        model_params.beta: 'beta_{0:.2f}' 
        model_params.Nmax: 'Nmax_{0:.1f}'
        #algorithm_params.trunc_params.chi_max: 'chi_{0:04d}'
    suffix: .h5
#skip_if_output_exists: True
save_every_x_seconds: 14400
save_psi: True  # don't save full wave function - less disk space, but can't resume/redo measurements!
save_resume_data: True

log_params:
    to_stdout: WARNING  # always check this output - empty is good
    to_file: INFO
    # format: "{levelname:.4s} {asctime} {message}"

cache_params:
    storage_class: PickleStorage
    use_threading: True# reduce the OMP_NUM_THREADS if you use this!
    # further specify `directory` or `tmpdir` on the cluster node's local file system
    tmpdir: /scratch/local/tmp 
    delete: True
cache_threshold_chi: 600  # use cache for chi larger than that

model_class : Spin_Holstein_Model
model_params :
    L: 24 #define number of sites
    beta: 100 #stiffness of the trap
    c_z : 2 
    J : 1.0 #coupling spin-spin
    F_z : 1.5 #pre factor interaction term
    Nmax: 12 #maximal boson occupation
    #L_cutoff : L #cut off interactions after half the spins
    #displacement: True

initial_state_params:
    method : lat_product_state
    product_state : [[up, 0], [down, 0]]

algorithm_class: TwoSiteDMRGEngine
algorithm_params:
    mixer: SubspaceExpansion
    mixer_params:
        amplitude: 1.e-6 #amplitude of the mixer
        decay: 2.0 #amplitude is divided bz factor decay after each sweep
        disable_after: 5 #disable mixer after this number of sweeps
    chi_list:
        0: 10
        2: 50
        5: 1000
    lanczos_params:
        N_max: 3 #parameters from https://github.com/ITensor/ITensorBenchmarks.jl/blob/main/src/tenpy_itensor_comparison/densempo_tenpy_1d_dmrg.py
        N_min: 3
        N_cache: 20
        reortho: False
    max_E_err: 1.e-10
    N_sweeps_check: 1 #check for convergence after every N sweeps
    max_hours: 44
    combine: True


connect_measurements:
    - - tenpy.simulations.measurement
      - m_onsite_expectation_value
      - opname: Sz
        fix_u: 0
    - - tenpy.simulations.measurement
      - m_onsite_expectation_value
      - opname: Sigmaz
        fix_u: 0
    - - tenpy.simulations.measurement
      - m_onsite_expectation_value
      - opname: N
        fix_u: 1
    - - simulation_method
      - wrap walltime               # "measure" wall clock time it took to run so far
    - - tenpy.tools.process
      - wrap memory_usage           # "measure" the current RAM usage in MB

Could it be that it is related to the chi_list. Do i maybe have to specify a chi_max?

I would appreciate any hint or advise! Thank you very much.

Best,
Jakob

TeNPy Forum

Strategies to run large-memory calculations?

Strategies to run large-memory calculations?

Re: Strategies to run large-memory calculations?

Re: Strategies to run large-memory calculations?