Parallel iDMRG with mpi4py

mircomarahrens · Post by **mircomarahrens** » 23 Aug 2018, 09:05

Hey there,

I am working on a Parallelization of iDMRG following https://arxiv.org/abs/1606.06790 which could be interesting for TeNPy running on a Cluster and, if it runs, I would like to add to the package. So far I implemented, quick and dirty, a running version of it using the multiprocessing module and partial from the functools module. Here a minimal code snippet (without tenpy) as basis for further discussions on the parallelization scheme:

Python: Select all

import multiprocessing as mp
from functools import partial

# define a worker function for doing some calculations
def worker_function(data, iterable):
    # do something
    data = 1+iterable
    #print(mp.current_process())
    return(data)

# prepare pool with nbp=4 processes
pool = mp.Pool(4)

# prepare list of iterables, e.g. sites in the system
iterables = [0,1,2,3]

# some data here, tensor network data like mps, mpo, etc. in the final code
data = None

# prepare worker and do calculations
worker = partial(worker_function, data)
results = pool.map(worker, iterables)

# close and join
pool.close()
pool.join()

# print the content of results
for result in results:
    print(result)

This is not very efficient since data could be very large and is fully broadcasted to every process. My Idea is instead to split data by the iterables and then scatter it to the number of processes and collect (gather) the results afterwards. Therefore I want to use mpi4py (Message Passing Interface for Python, https://mpi4py.readthedocs.io/en/stable/, also included in Intel Distribution for Python, https://software.intel.com/en-us/distri ... for-python, which is pretty fast and running at least on the bwunicluster). I'll keep working on it. Nevertheless anyone who is familiar with mpi4py or who has some other remarks or hints how to proceed is welcome to comment here

.

Best, mm

Johannes · Post by **Johannes** » 29 Aug 2018, 10:09

Hey Mirco,
parallelization of DMRG is a great idea which will probably benefit many people!

Using mpi4py seems a good idea to me. As you said, it's included in the intel python distribution, and also in the anaconda distribution - these two are the recommended ones in the installation instructions, so that's fine

I looked at the paper by Ueda you've cited, and I wonder what's the advantage of their approach compared to https://arxiv.org/abs/1301.3494? I couldn't really understand it from the paper.
The latter approach seems to be more intuitive to me and easier to implement when distributing the memory on different nodes - if you have 40 sites in the MPS and 10 nodes, you just store the parts for each 4 sites on every node and only share the boundaries of the state/environment. (Since Miles Stoudenmire is one of the authors, I would strongly guess that it's also the approach used in ITensor).
As far as I can see from the paper by Ueda, Hida's approach requires exchanging the states all the time, which would be a serious drawback. Or did I miss something?

mircomarahrens · Post by **mircomarahrens** » 30 Aug 2018, 09:57

Hey Johannes,

the approach by Ueda is separating the system into disconnecting parts, working on those in parallel and then updating the environments. The information flow is different in this approach since one does not sweep through the system in the sense of left-right sweeping. I guess this is valid because one starts with random coefficients for the projected bond states anyway.

I would further say that this approach is build on top of the parallelization scheme by Miles/White. To stay with the example, having 4 sites on each node and enumerate them by [1,2,3,4] on node A, [5,6,7,8] on node B, etc. and say each node has 12 cores. Following Ueda one can separate the 4 sites further into blocks, e.g. on node A into [1,2] and [3,4] and on node B into [5,6] and [7,8] and so on with the corresponding left and right environments (in the sense of two-site update DMRG) and provide each block update with 6 cores. In the next step one can then update block [2,3] and [6,7] on the corresponding nodes with the previously updated environments. Now the updates between the bonds crossing the nodes are left such that we have to exchange the information about the updated parts and reseparate the system. So the point of exchanging the state is the same as in the approach by Miles/White, or? At least this is how I understood the approach

.

Another Python related thing I encountered you may can help me. Having Numpy/Scipy installed with a multithreading library like MKL in the back I guess one should set the number of threads to avoid competition between the processes. What do you think is the best way to do this? Should one set the threads of the libraries at the beginning to single-threading or would it be better to set the threads somewhere in the script and reload the libraries then?

Johannes · Post by **Johannes** » 30 Aug 2018, 12:43

Hi Mirco,
I see, so they are basically the same in the limit of half as many nodes as sites.
I wonder whether the full parallelization of Ueda's approach requires more or less iterations/sweeps to reach convergence. As far as I see, compared to the sweeping it needs more (two-site) updates to carry information from left to right, so my naive guess would have been that the usual sweeping would actually perform better - but I might be wrong. Do you know of any direct comparisons?

Don't get me wrong, I don't want to stop you from implementing this version, if you think it's better. It just seems harder to me, so I'd like to understand why

At least on Linux (I don't have much experience on Windows

), I think it's easier to just set the number of cores beforehand in the bash script which starts the job, using the suitable environment variables like MKL_NUM_THREADS or OMP_NUM_THREADS, especially since the details how to set it depend on the python distribution used.
If you want to set it dynamically, take a look at the wrapper functions provided in tenpy.tools.process.

mircomarahrens · Post by **mircomarahrens** » 31 Aug 2018, 09:05

Hey Johannes,

I dunno which version is better or of any direct comparisons

. It could be interesting to do that, but I guess they converge equivalent. The Ueda approach reminds me on a block decimation approach like TEBD. This helps me to understand what is going on during the simulation better, e.g. it takes the number-of-sites steps to distribute the information of one site to all other ones during the simulation once. I think that this is the same in the sweeping approach and that is the bottleneck for the convergence. Anyway, I guess the details of how to perform the DMRG should be more or less independent to the parallelization scheme.

I am not a windows user either

. I took a look at the wrapper functions in tenpy.tool.process. Am I right that one has to reload preloaded modules like numpy/scipy after setting the number of threads or is there any other way to do that?

Johannes · Post by **Johannes** » 31 Aug 2018, 09:50

mircomarahrens wrote: ↑31 Aug 2018, 09:05 Am I right that one has to reload preloaded modules like numpy/scipy after setting the number of threads or is there any other way to do that?

I don't think you need to reload scipy or numpy. It's still the same library, using the same underlying MKL, it's just a sinlge parameter, which can be changed dynamically. I didn't check this in a while, but I remeber that it worked as I expected without reloading numpy/scipy a while ago...
Just check yourself

Johannes · Post by **Johannes** » 31 Aug 2018, 10:08

mircomarahrens wrote: ↑31 Aug 2018, 09:05 I dunno which version is better or of any direct comparisons . It could be interesting to do that, but I guess they converge equivalent. The Ueda approach reminds me on a block decimation approach like TEBD. This helps me to understand what is going on during the simulation better, e.g. it takes the number-of-sites steps to distribute the information of one site to all other ones during the simulation once. I think that this is the same in the sweeping approach and that is the bottleneck for the convergence. Anyway, I guess the details of how to perform the DMRG should be more or less independent to the parallelization scheme.

Yes, Ueda's scheme is like a Suzuiki-Trotter decomposition of TEBD, but doing DMRG updates. But that that means after updating each bond exactly twice (even, odd, even, odd), you have transported information by at most 4 sites (or maybe 5 for two-site updates). In contrast in the usual DMRG, you do a full left->right->left sweep with the same number of bond updates, so you have transported information throughout the whole system. If you use the parallelization of sweeps, say for \(n\) segments of \(L/n\) sites, you can still transport information by \(2* L/n\) sites with a single sweep updating each bond twice. That's why I though they don't perform equally well if \(L/n\) is still more than just 2 sites....

TeNPy Forum

Parallel iDMRG with mpi4py

Parallel iDMRG with mpi4py

Re: Parallel iDMRG with mpi4py

Re: Parallel iDMRG with mpi4py

Re: Parallel iDMRG with mpi4py

Re: Parallel iDMRG with mpi4py

Re: Parallel iDMRG with mpi4py

Re: Parallel iDMRG with mpi4py