Strategies to run large-memory calculations?

How do I use this algorithm? What does that parameter do?
Post Reply
DiracString
Posts: 7
Joined: 28 Jun 2023, 23:37

Strategies to run large-memory calculations?

Post by DiracString »

Hi,
I am trying to run some DMRG simulations on a cluster. However, I eventually run out of memory. In the slurm file, I get the error:

Python: Select all

Traceback (most recent call last):
  File "/mmfs1/gscratch/.../conda_env/lib/python3.7/site-packages/tenpy/algorithms/dmrg.py", line 1831, in mix_rho_L
    LHeff = engine.LHeff
AttributeError: 'TwoSiteDMRGEngine' object has no attribute 'LHeff'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 191, in <module>
    info = eng.run() # the main work; modifies psi in place
....
....
File "tenpy/linalg/_npc_helper.pyx", line 684, in tenpy.linalg._npc_helper.Array_itranspose_fast
numpy.core._exceptions.MemoryError: Unable to allocate 22.5 GiB for an array with shape (19431, 2, 1, 19431, 2) and data type complex128
Due to unfamiliarity with parallelization, the only thing I am doing right now is having the following line before loading python and running the python script in my bash script:

Python: Select all

export OMP_NUM_THREADS=192

As for the nodes I used, I tried both of the following but get similar errors:

Python: Select all

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=96
#SBATCH --ntasks=192
#SBATCH --mem=1495GB
and

Python: Select all

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=96
#SBATCH --ntasks=1
#SBATCH --mem=1495GB
My script code is quite standard, with the simulation actually running using:

Python: Select all

# added below, changes psi
eng = dmrg.TwoSiteDMRGEngine(psi, M, dmrg_params)
info = eng.run() # the main work; modifies psi in place

E = info[0]
psi = info[1]
Do you have any suggestions on how I could make this job run? Is there a way to perhaps write arrays to disk during the process instead of doing everything in memory? Or some other strategy?

Thanks!
User avatar
Johannes
Site Admin
Posts: 457
Joined: 21 Jul 2018, 12:52
Location: TU Munich

Re: Strategies to run large-memory calculations?

Post by Johannes »

I guess I'm late for you to solve the problem, but maybe as a hint for people trying to solve it in the future:
The #SBATCH --nodes=2 asks SLURM for 2 nodes with MPI paralellization. TeNPy can not utilize that (unless you're working in the mpi_parallel branch, which is for very specific use cases only), so doing that you only request double the amount of resources from the cluster than you can actually use.

What TeNPy can do is indeed to cache some intermediate data to disk - the big part in Memory are the DMRG environments, and we support caching them using the cache system in tenpy.tools.cache, see init_cache and open.
Using the Simulation setup, you "just" need to add the following parameters (globally in your simulation parameters):

YAML parameters: Select all

cache_params:
    storage_class: PickleStorage
    use_threading: True  # reduce the OMP_NUM_THREADS if you use this!
    directory: "/scratch/user_12abc/tmp/cache" # specify `directory` or `tmpdir` on the cluster node's local file system (not network storage!!!)
...  # other parameters like dmrg_params etc
Make sure you create that directory/tmpdir on the cluster node in a path unique to your user, and clean it up / remove it at the end of your job - ideally before and after the python call from your SLURM job script.
Post Reply