Notes 

----------02 June 2001----------

Using the gaussian_taketa.py program now. What works:

he_2s:
Everything works, although GAMESS's values for (22|11) match mine for
(21|21) and vice versa. Their integrals may use physics notation
rather than chemists. Not sure, though. Never noticed this before.

h2_2s:
S,t work.
V doesn't

he_sp:
S,t,v for 1s and 2p on the same center
(1s,1s|1s,1s) on the same center

h_sp:
S,t
Nuclear doesn't work when shifting center.


----------03 June 2001----------
Fixed a bug in _Fgamma: the minimum value of x should be set with
abs(x) rather than x, since I believe values < 0 are valid.

Didn't fix the problems in v for h2_2s.

Make an easier test case: proton at 0,0,0, 1s gaussian at 1,0,0.
Call this h_off.

Fixed a bug in nuclear: p. 2317 of THO wasn't clear whether eps =
1/(4*gamma) or eps = gamma/4. I had assumed the later, but it turns
out that it's the former.

h_off now matches.

h2_2s s,t,v now match.

The twoe integrals don't match, though. The ones where all four are on
the same center match, but ones with mixed centers don't.

Reproduced this in coul_1s_test: two 1s gaussians 1 bohr apart.
Found one mistake: gaussian computes the 2e integral as <ij|ij>,
 gaussian_taketa computes [ii|jj]. Fixed this -- which also explains
 the confusion I had yesterday regarding the 2e integrals. But there
 is still a problem somewhere.

Oddly enough, fixed a mistake in gaussians. Typo:
(alphaa+alphab+alphac+alphad) was in fact
(alphaa+alphab+alphad+alphad). Still didn't affect the  outcome.

Fixed another bug:
delta = 0.25*(1/gamma1+1/gamma2) (it was 0.25(gamma1+gamma2)).

h2_2s now matches for the two e integrals as well.


he_sp
stv match.

Twoe ints:
        Mine	    Correct
(00|00)=1.595769    match
(10|10)=12.500187   1.329808
(11|00)=0.265961    match
(11|11)=98.113173   1.303211
(20|20)=12.500187   1.329808
(21|21)=97.953596   1.143635
(22|00)=0.265961    match
(22|11)=0.079788    match
(22|22)=98.113173   1.303211
(30|30)=12.500187   1.329808
(31|31)=97.953596   1.143635
(32|32)=97.953596   1.143635
(33|00)=0.265961    match
(33|11)=0.079788    match
(33|22)=0.079788    match
(33|33)=98.113173   1.303211

----------6/10/2001----------

It seems like either the B terms are incorrect, or the F terms are
incorrect.

In all of the cases that the integrals work, all of the B terms are
either 0 or 1, and the F term is 1.

Other observations: only the gser subfunction is tested. gcf never
gets called. All of the arguments appear to be very small.

For want of a better idea, I'm going to start looking at the B's,
since the Fgamma code is also tested with the one-e int code a little
bit (although by no means completely). Also, I copied the Fgamma code
from numerical recipes (again, though, by no means correctly). Might
be worthwhile to compare against the compiled fortran code at some
point.

Testing the B's...

----------11 June 2001----------

Did some more testing of F_gamma. Programmed Shavitt's expression for
small t (p.8 of the paper, eq 24). This matches my results for
multiple m's for everything I've tested.

Clearly this isn't the mistake.

Back to testing the B's

x:  0 0 0 0 0
y:  0 0 0 0 0
z:  0 0 0 0 0
1.0 1.0 1.0 0.999999996667
(00|00)=1.595769
(10|00)=0.000000
x:  2 0 0 0 1
y:  0 0 0 0 0
z:  0 0 0 0 0
-0.0625 1.0 1.0 0.333333331333
x:  2 0 1 0 0
y:  0 0 0 0 0
z:  0 0 0 0 0
1.0 1.0 1.0 0.999999996667
(10|10)=12.500191
x:  1 1 0 0 1
y:  0 0 0 0 0
z:  0 0 0 0 0
0.0625 1.0 1.0 0.333333331333
(11|00)=0.265962
(11|10)=0.000000
x:  2 2 0 0 2
y:  0 0 0 0 0
z:  0 0 0 0 0
0.01171875 1.0 1.0 0.199999998571
x:  2 2 0 1 1
y:  0 0 0 0 0
z:  0 0 0 0 0
-0.0625 1.0 1.0 0.333333331333
x:  2 2 1 0 1
y:  0 0 0 0 0
z:  0 0 0 0 0
-0.0625 1.0 1.0 0.333333331333
x:  2 2 1 1 0
y:  0 0 0 0 0
z:  0 0 0 0 0
1.0 1.0 1.0 0.999999996667
(11|11)=98.113205


----------13 June 2001----------
Rewrote the coulomb routine to have a maximum 5th order loop instead
of the 15th order I had before.

Get the same (incorrect) results, but much easier to read.

It works! Turned out that when I simplified the code sufficiently and
reused as many small routines as possible I realized that I had
dropped the _fact(u) from the denominator.

Testing h2_sp

One_E ints work!
TwoE ints that don't work in GAMESS notation:
             I get	  GAMESS gets
5 2 1 1	     0.729788	  0.533009
5 1 2 1      ???          0.983 (mine should be the same?)
5 3 1 3      .983	  .983 (5 3 3 1)

But other integrals that might not match 
e.g. 5 5 2 2 do match.

Could GAMESS be averaging the exchanges?

5 1 1 1 = 0.631398	matches
5 2 1 1 = 0.729788	0.533008578071E+00
5 2 1 2 = -0.028665	???
5 3 1 3 = 0.098390	5 3 3 1 matches
5 4 1 4 = 0.098390	5 4 4 1 matches
5 5 1 1 = 0.842701	
5 1 2 1 = -0.098390	wrong sign
5 2 2 1 = -0.028665	???
5 2 2 2 = 0.584140	0.539206246552E+00
5 3 2 3 = -0.028665
5 4 2 4 = -0.028665
5 5 2 1 = -0.213797
5 5 2 2 = 0.848944
5 3 3 1 = 0.098390
5 3 3 2 = -0.028665
5 5 3 3 = 0.735802
5 4 4 1 = 0.098390
5 4 4 2 = -0.028665
5 5 4 4 = 0.735802
5 1 5 1 = 0.415107
5 2 5 1 = 0.415107
5 2 5 2 = 0.484292
5 3 5 3 = 0.069185
5 4 5 4 = 0.069185
5 5 5 1 = 0.631398
5 5 5 2 = 0.533009
5 5 5 5 = 1.128379
6 1 1 1 = -0.533009	-0.729787961457E+00
6 2 1 2 = 0.075923	wrong sign

So there are several discrepancies: wrong signs. Possible averaged
exchanges.


----------14 June 2001----------
Rewrote the nuclear and kinetic routines to be shorter and easier to
understand. I also reuse more code, to make maintenance easier and
bugs less likely.

Wrote a contracted (CGBF) and primitive (renamed gaussian_taketa to
PGBF) package. Am now writing test programs to test this out.

he_631.py tests 1s contracted functions. All 1e and 2e ints work.

h2_631s.py tests 1s, 2s and 2p contracted functions, on two different
centers. Most 1e integrals work, although there is something wrong
with:
<1s,1|t+v|2px,2>
<2px,1|t+v|1s,2>
<2px,1|t+v|2s,2>
since the corresonding T terms are correct, the problem is with the
nuclear attraction.

The 2e ints with all terms on 1 center are correct. Many of the ones
on mixed centers are wrong, starting with (63|11). This may be related
to the 1e problems I'm having.

To debug this problem, go back to h2_sp.py

1e ints match, as do the 2e ints on the same center. But 2e ints on
different center do not match.

----------19 June 2001----------
Fixed a bug in the 1e V ints:

        #Ax = _A_array(l1,l2,P[0]-A[0],P[0]-B[0],C[0]-P[0],gamma)
should have been:
        Ax = _A_array(l1,l2,P[0]-A[0],P[0]-B[0],P[0]-C[0],gamma)
(etc. for Ay and Az)

All of the 1e ints in h2_631s.py now work.

The first 2e ints to not work in h2_631s are 63-11 and 61-31.
Changing from abs(p-a) to p-a in _fB has no effect.

Changed:
          *pow(Px-Qx,i1+i2-2*(r1+r2)-2*u) \
to:
          *pow(Qx-Px,i1+i2-2*(r1+r2)-2*u) \
in _B_term, and now several more terms work.

In fact, now only one term doesn't work: (63|31) works, but (64|41)
doesn't. My program gets these two terms to be equal to each other,
but GAMESS has them different. Could be that GAMESS changes its
reference frame.

Here's the part that differs:
  (63|31)     0.1712  0.1713  
**(64|41)     0.1065  0.1713**
**(65|51)     0.1065  0.1713**
  (61|33)     0.5974  0.5975  
**(61|44)     0.5691  0.5975**
**(61|55)     0.5691  0.5975**
  (63|32)     0.1121  0.1121  
**(64|42)     0.0682  0.1121**
**(65|52)     0.6820  0.1121**
  (62|33)     0.4364  0.4364  
**(62|44)     0.4127  0.4364**
**(62|55)     0.4127  0.4364**
  (63|33)     0.4973  0.4973  
**(63|44)     0.4241  0.4973**
**(63|55)     0.4241  0.4973**

Go back to h2_sp to see how it looks: Quite a few are wrong.

Removed the abs() from _fB, and everything matches.

Went back to check h2_631s. Still wrong in the same places. Odd that
nothing changed.

Found the problem: I had actually entered the px functions three times
instead of entering separate functions for the x, y, and z
bases. Everything works now.


----------20 June 2001----------
   Tested h2o/6-31G**, which reproduces the GAMESS energy. The
   integrals are very slow, however.


----------22 June 2001----------
Profiling h2o:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 83992428 1967.240    0.000 1967.240    0.000 PGBF.py:178(_fact)
 11547252 1255.600    0.000 2074.810    0.000 PGBF.py:283(_fact_ratio2)
  7770861 1029.430    0.000 -4654.105   -0.001 PGBF.py:227(_binomial_prefactor)
        9  814.440   90.493 1758.530  195.392 hartree_fock.py:104(get2JmKfromInts)
  3849084  744.950    0.000 -613.865   -0.000 PGBF.py:257(_B_term)
  7698168  744.140    0.000 -2058.355   -0.000 PGBF.py:278(_fB)
   441414  563.730    0.001 -2307.102   -0.005 PGBF.py:145(coulomb)
  7698168  512.360    0.000 1887.630    0.000 PGBF.py:281(_B0)
 10599850  479.820    0.000  479.820    0.000 hartree_fock.py:68(ijkl2intindex)
  2680150  428.050    0.000  676.410    0.000 PGBF.py:322(_gser)
  7050000  307.720    0.000  307.720    0.000 dmatrix.py:117(__getitem__)
  2880198  267.000    0.000  267.000    0.000 PGBF.py:291(_gammln)
  2880198  233.120    0.000 1374.370    0.000 PGBF.py:285(_Fgamma)
  2880198  208.320    0.000  948.980    0.000 PGBF.py:311(_gammp)
  2880198  192.270    0.000 1141.250    0.000 PGBF.py:306(_gamm_inc)


I had thought the gamma function stuff was the slow part, but it looks
like the factorial routines are. Move these routines into cutil.

Here's the profiling for h2 using the factorial, binomial, gamma
functions from pyutil

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   780204   17.670    0.000   17.670    0.000 pyutil.py:9(fact)
   154416   16.640    0.000   26.980    0.000 pyutil.py:39(binomial)
   103572   11.240    0.000   18.440    0.000 PGBF.py:250(fact_ratio2)
    75492    9.700    0.000   36.680    0.000 pyutil.py:30(binomial_prefactor)
        4    9.450    2.363   20.170    5.043 hartree_fock.py:104(get2JmKfromInts)
     6372    7.160    0.001   94.010    0.015 PGBF.py:146(coulomb)
    69048    6.590    0.000   57.140    0.001 PGBF.py:245(_fB)
    34524    6.310    0.000   69.200    0.002 PGBF.py:224(_B_term)
   121540    5.370    0.000    5.370    0.000 hartree_fock.py:68(ijkl2intindex)
    69048    4.430    0.000   17.120    0.000 PGBF.py:248(_B0)
    19116    4.220    0.000   73.420    0.004 PGBF.py:232(_B_array)
    81400    3.510    0.000    3.510    0.000 dmatrix.py:117(__getitem__)
    20275    3.230    0.000    4.900    0.000 pyutil.py:80(_gser)
    40400    1.970    0.000    1.970    0.000 dmatrix.py:129(__setitem__)
    21716    1.930    0.000    7.280    0.000 pyutil.py:69(gammp)

Okay, have now reproduced fat,fact2,binomial_coefficient, and Fgamma
in C. The time for h2 dropped from 23 to 14 secs.

Here's the new profiling:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    9.620    2.405   19.570    4.893 hartree_fock.py:104(get2JmKfromInts)
    34524    6.520    0.000   19.420    0.001 PGBF.py:225(_B_term)
     6372    6.200    0.001   32.050    0.005 PGBF.py:147(coulomb)
    69048    5.230    0.000   11.720    0.000 PGBF.py:246(_fB)
   121540    4.930    0.000    4.930    0.000 hartree_fock.py:68(ijkl2intindex)
    69048    4.470    0.000    6.490    0.000 PGBF.py:249(_B0)
    19116    3.980    0.000   23.400    0.001 PGBF.py:233(_B_array)
    81400    3.340    0.000    3.340    0.000 dmatrix.py:117(__getitem__)
   103572    3.200    0.000    3.200    0.000 PGBF.py:251(fact_ratio2)
    40400    1.860    0.000    1.860    0.000 dmatrix.py:129(__setitem__)
     1540    1.750    0.001   34.500    0.022 CGBF.py:70(coulomb)
    14730    0.660    0.000    0.660    0.000 PGBF.py:180(_gaussian_product_center)
    25488    0.590    0.000    0.590    0.000 PGBF.py:52(exp)
    27108    0.560    0.000    0.560    0.000 PGBF.py:56(coef)
    25488    0.470    0.000    0.470    0.000 PGBF.py:54(powers)

I'm going to move fB, fact_ratio2 & ijkl2index into cutils:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6372    5.990    0.001   16.170    0.003 PGBF.py:147(coulomb)
        4    5.500    1.375   11.040    2.760 hartree_fock.py:94(get2JmKfromInts)
    19116    4.380    0.000    7.480    0.000 PGBF.py:233(_B_array)
    81400    3.700    0.000    3.700    0.000 dmatrix.py:117(__getitem__)
    34524    3.100    0.000    3.100    0.000 PGBF.py:225(_B_term)
    40400    1.890    0.000    1.890    0.000 dmatrix.py:129(__setitem__)
     1540    1.490    0.001   18.290    0.012 CGBF.py:70(coulomb)
    25488    0.690    0.000    0.690    0.000 PGBF.py:52(exp)
    14730    0.610    0.000    0.610    0.000 PGBF.py:180(_gaussian_product_center)
    27108    0.570    0.000    0.570    0.000 PGBF.py:56(coef)
    25488    0.490    0.000    0.490    0.000 PGBF.py:55(norm)
    25488    0.490    0.000    0.490    0.000 PGBF.py:54(powers)
    25488    0.470    0.000    0.470    0.000 PGBF.py:53(origin)
     4782    0.340    0.000    0.340    0.000 PGBF.py:199(_overlap_kernel_1D)
     1594    0.300    0.000    0.690    0.000 PGBF.py:186(_overlap_kernel_3D)


Just tried running h2o, and there is still a problem with the new
routines. I.e.,

H2O: -74.764101
Energy =  1563.19315451
Energy =  139.402946893
Energy =  1396.67898276

Putting in the pyutil version of ijk2intindex. Still didn't work, so
going back to cutil version of ijk2intindex, and will use the pyutil
version of fB

H2: -1.083260
H2O: -74.764101
Energy =  1563.19315451
Energy =  139.402946893
Energy =  1396.67898276

Still doesn't work.

----------23 June 2001----------

Now use pyutil version of ijk2..., fB, and fact_ratio2. Still didn't
work. 

Now go back to pyutil versions of everything.
H2O: -74.764101
Energy =  1563.19315451
Energy =  139.402946893
Energy =  1396.67898276

Now go back to pyutil versions of everything. Still doesn't work. Damn
damn damn.

Test h2: still works with everything in pyutil. Also works with
everything in cutil.

Interestingly, the cutil and pyutil versions give exactly the same
wrong answer.

I have replaced gammln with the C library function lgamma. This
shouldn't improve anything, but should be faster, and should be less
code to maintain. There is also a "tgamma" function in the -lm
library, but I don't know what that does.
 
Still gives the same wrong answer.

Okay, go back to the python versions of everything, and see what I've

See if I can find a simpler case: LiH?

LiH also doesn't work:
LiH: -7.900068
Energy =  21.4723372454
Energy =  85.3812870746

For LiH, the overlap matrix matches.
The one-electron integrals do not match, for the <s|t+v|xx> terms!

Okay, here's the line that seems to be causing the problem:
if val < SMALL: return 0
in Fgamma.

I had put this line in to get the C and Py versions of Fgamma to
agree.

LiH now works!
LiH: -7.900068
Energy =  -7.55238503575
Energy =  -7.89744258211
Energy =  -7.89987188533
Energy =  -7.90003411775
Energy =  -7.90006044044

Put in the C routines, but comment out the val<SMALL part.

Works: exactly the same results. Here's the profiling:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   220485  312.600    0.001  947.190    0.004 PGBF.py:146(coulomb)
   661455  298.800    0.000  525.140    0.001 PGBF.py:232(_B_array)
  2084397  226.340    0.000  226.340    0.000 PGBF.py:224(_B_term)
  1606800   89.010    0.000   89.010    0.000 dmatrix.py:94(__getitem__)
    22155   68.950    0.003 1043.100    0.047 CGBF.py:70(coulomb)
   801600   46.540    0.000   46.540    0.000 dmatrix.py:106(__setitem__)
   891930   24.740    0.000   24.740    0.000 PGBF.py:55(coef)
   881940   23.040    0.000   23.040    0.000 PGBF.py:52(origin)
   881940   22.900    0.000   22.900    0.000 PGBF.py:54(norm)
   881940   22.270    0.000   22.270    0.000 PGBF.py:53(powers)
   881940   21.870    0.000   21.870    0.000 PGBF.py:51(exp)
   453315   19.820    0.000   19.820    0.000 PGBF.py:179(_gaussian_product_center)
     9895    3.370    0.000    6.590    0.001 PGBF.py:185(_overlap_kernel_3D)
    29685    2.830    0.000    2.830    0.000 PGBF.py:198(_overlap_kernel_1D)
    91860    2.660    0.000    2.660    0.000 CGBF.py:37(norm)

Guess I need to move coulomb etc into pyutils. Actually, what I need
to do is move all of the work functions in PGBF into pyutils.

Moved overlap_kernel into pyutil. Still works.
Moved nuclear_repulsion into pyutil. Still works.
Moved coulomb_repulsion into pyutil. Still works. 1263.24 sec.

Moved coulomb_repulsion into cutil. Doesn't work, but only takes 100
sec.

Getting closer, although some of the 2eints still don't work:
Should be:
(19,19|18,12)=0.249542
(19,19|18,18)=0.848142
(19,19|19,4)=0.118960
(19,19|19,8)=0.049367
(19,19|19,14)=0.282706
(19,19|19,19)=0.966487

yutil. Still works.
Moved nuclear_repulsion into pyutil. Still works.
Moved coulomb_repulsion into pyutil. Still works. 1263.24 sec.

Moved coulomb_repulsion into cutil. Doesn't work, but only takes 100
sec.

Getting closer, although some of the 2eints still don't work:
Should be:
(19,19|18,12)=0.249542   (19,19|18,12)=0.292583 
(19,19|18,18)=0.848142	 (19,19|18,18)=0.848142 
(19,19|19,4)=0.118960	 (19,19|19,4)=0.122570  
(19,19|19,8)=0.049367	 (19,19|19,8)=0.049409  
(19,19|19,14)=0.282706	 (19,19|19,14)=0.330886 
(19,19|19,19)=0.966487	 (19,19|19,19)=0.966487 

Found the bug: it now works! 182 sec.

Profiling:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   220485  351.630    0.002  541.640    0.002 PGBF.py:130(coulomb)
        5  157.130   31.426  293.480   58.696 hartree_fock.py:93(get2JmKfromInts)
  1606800   92.710    0.000   92.710    0.000 dmatrix.py:94(__getitem__)
    22155   62.150    0.003  630.390    0.028 CGBF.py:70(coulomb)
  1763880   48.700    0.000   48.700    0.000 PGBF.py:53(exp)
  1763880   47.630    0.000   47.630    0.000 PGBF.py:56(norm)
  1763880   47.090    0.000   47.090    0.000 PGBF.py:54(origin)
  1763880   46.590    0.000   46.590    0.000 PGBF.py:55(powers)
   801600   44.180    0.000   44.180    0.000 dmatrix.py:106(__setitem__)
   891930   24.280    0.000   24.280    0.000 PGBF.py:57(coef)
    94140   13.040    0.000   21.200    0.000 pyutil.py:39(binomial)
   333444    9.520    0.000    9.520    0.000 pyutil.py:9(fact)
    45342    7.180    0.000   28.380    0.001 pyutil.py:30(binomial_prefactor)
    29685    5.930    0.000   27.140    0.001 pyutil.py:157(overlap_kernel_1D)
    12756    3.510    0.000   12.860    0.001 pyutil.py:195(A_term)

Sigh. The coulomb stuff still takes a long time.

Putting the rest of the functions in C, since they're effectively
done. Overlap_kernel is there, and the time has increased slightly, to
179 sec.

Putting in nuclear_attraction, and the time is 170 sec.

Made it a tiny bit faster by removing some redundant calls from the
top of coulomb. 156 sec.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   220485  201.940    0.001  303.550    0.001 PGBF.py:116(coulomb)
        5  160.600   32.120  296.460   59.292 hartree_fock.py:93(get2JmKfromInts)
  1606800   90.860    0.000   90.860    0.000 dmatrix.py:94(__getitem__)
    22155   57.290    0.003  387.430    0.017 CGBF.py:70(coulomb)
   801600   45.480    0.000   45.480    0.000 dmatrix.py:106(__setitem__)
   881940   26.030    0.000   26.030    0.000 PGBF.py:49(origin)
   881940   25.290    0.000   25.290    0.000 PGBF.py:51(norm)
   881940   25.270    0.000   25.270    0.000 PGBF.py:48(exp)
   881940   25.020    0.000   25.020    0.000 PGBF.py:50(powers)
   891930   24.560    0.000   24.560    0.000 PGBF.py:52(coef)
    91860    2.370    0.000    2.370    0.000 CGBF.py:37(norm)
        1    2.290    2.290  389.720  389.720 hartree_fock.py:46(get2ints)
      800    0.490    0.001    1.010    0.001 CGBF.py:63(nuclear)
     1225    0.390    0.000    0.390    0.000 PGBF.py:59(kinetic)
     2450    0.370    0.000    0.370    0.000 PGBF.py:84(nuclear)

Why is the slow part in the PGBF coulomb rather than the cutil
coulomb?? Are function calls to get variables slower than direct
access? Try out direct access to see the difference. But this may make
a bigger difference in profiling than it does in the normal code.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5  164.860   32.972  304.390   60.878 hartree_fock.py:93(get2JmKfromInts)
  1606800   91.920    0.000   91.920    0.000 dmatrix.py:94(__getitem__)
    22155   61.380    0.003  143.520    0.006 CGBF.py:70(coulomb)
   220485   54.890    0.000   54.890    0.000 PGBF.py:116(coulomb)
   801600   48.060    0.000   48.060    0.000 dmatrix.py:106(__setitem__)
   891930   25.130    0.000   25.130    0.000 PGBF.py:52(coef)
    91860    2.480    0.000    2.480    0.000 CGBF.py:37(norm)
        1    2.280    2.280  145.800  145.800 hartree_fock.py:46(get2ints)
      800    0.490    0.001    0.970    0.001 CGBF.py:63(nuclear)
     1225    0.390    0.000    0.390    0.000 PGBF.py:59(kinetic)
        5    0.370    0.074    0.680    0.136 hartree_fock.py:109(get_energy)
     2450    0.330    0.000    0.330    0.000 PGBF.py:84(nuclear)
      420    0.270    0.001    0.480    0.001 CGBF.py:49(overlap)
        1    0.240    0.240    2.530    2.530 hartree_fock.py:31(get1ints)
      400    0.240    0.001    0.730    0.002 CGBF.py:56(kinetic)

Amazing! Took the time for PGBF::coulomb from 200 to 50! Does this
have the same impact on the non-profiling runtime? 146 sec.

I guess the moral is that if you do something 200,000 times, even
small things have an impact.


----------25 June 2001----------
Renamed *util* to *ints*, and confirmed that it works. I think
cints/pyints makes more sense given what the routines do. Am also
considering writing a C++ interface to the code, and may actually
convert everything to C++ (using it as a "better C").

Also broke off all of the wrapper functions to a file called
cints_wrap.c. This will make things a bit cleaner when I write a C++
wrapper. 

----------18 July 2001----------
Got the C version of the code working (subdirectory "c"). It's
surprisingly slow, which is reassuring in its own way. The pure C code
takes 75 seconds for lih (4 iterations, converged to -7.892 107, when
the true answer is -7.900 068).

In 4 iterations the python code had already gotten to -7.900 034, and
in five it had converged to -7.900 060, taking 107 sec. This indicates
that there is probably a small numerical mistake somewhere in the C
code. But the good news is that the python code isn't that slow
afterall, which indicates that I can use this code for my development
without worrying too much that I'm wasting time.

Interestingly, the integrals take about the same amount of time in
both (eyeball test). The noticable difference between the two is the
scf iterations. I had assumed that it was the integrals that were the
slow part, and that the SCF was fast in python. Might be worth
spending some time to figure out why this is the case.

Taking out debugging flags, and compiling -O3 yields 54 secs.

Run H2 to see if I can find the source of the numerical error. The
results are exactly the same.


Now trying to convert the C code to C++, to (1) simplify the code, and
(2) learn more about C++. The first step is to leave the data
structures the same, and update the code to C++.

A. Remove the struct's. Done.
B. Remove the linked-list data structures. This involves removing next
   pointers from the structs, and removing structs that consist
   entirely of pointers to heads and tails. Done, although there are
   now lots of compile errors.
C. Starting at the smallest level, make the structs into classes.
   Done; cbf and pbf tests work, but dmatrix and hartree_fock tests
   crash. 

Got the test cases to work; I now have a working HF code in C++.

ToDo List:
1. obj.print() with overloaded operator<< friend function
   DONE
2. Replace C-style casts with C++ casts.
3. Profile code.
4. Talk to Julian or Michael to see how to improve code.
5. Run the code through Parasoft tools to see if there are problems.
6. Figure out the source of the numerical errors.
7. Make private variables private

Profiling
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 15.28      8.71     8.71 144935496     0.00     0.00  fact(int)
 12.37     15.76     7.05  4769174     0.00     0.00  gser(double *, double
 10.65     21.83     6.07 13355782     0.00     0.00  binomial_prefactor(in
 10.35     27.73     5.90 28326022     0.00     0.00  binomial(int, int)
  9.65     33.23     5.50 19969530     0.00     0.00  fact_ratio2(int, int)
  7.42     37.46     4.23  6656510     0.00     0.01  B_term(int, int, int,
  5.30     40.48     3.02 13313020     0.00     0.00  B0(int, int, double)

Redoing profiling with inlined fact:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 18.74     11.23    11.23 144935496     0.00     0.00  fact(int) 
 12.05     18.45     7.22  4769174     0.00     0.00  gser(double *, do
 10.48     24.73     6.28 13355782     0.00     0.00  binomial_prefacto
  9.73     30.56     5.83 28326022     0.00     0.00  binomial(int, int)
  9.21     36.08     5.52 19969530     0.00     0.00  fact_ratio2(int, int)
  7.44     40.54     4.46  6656510     0.00     0.01  B_term(int, int, 

slower. go figure.

Non-inlined iterative factorial.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 18.68     10.63    10.63 144935496     0.00     0.00  fact(int)     
 12.03     17.48     6.85  4769174     0.00     0.00  gser(double *, 
 10.07     23.21     5.73 28326022     0.00     0.00  binomial(int, int)
  9.72     28.74     5.53 13355782     0.00     0.00  binomial_prefac
  9.52     34.16     5.42 19969530     0.00     0.00  fact_ratio2(int, int)
  7.84     38.62     4.46  6656510     0.00     0.01  B_term(int, int
  4.88     41.40     2.78  2031936     0.00     0.02  B_array(int, in
  4.69     44.07     2.67 13313020     0.00     0.00  fB(int, int, in
  4.62     46.70     2.63 13313020     0.00     0.00  B0(int, int, double)

Hmm. Don't understand this.

Caching factorials through 9.

Timing compare for LiH: no cache 53.1 sec (-O3 opt); cache: 52.5 sec.

Take out optimization and reprofile:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 16.17      5.86     5.86 19969530     0.00     0.00  fact_ratio2(int, int)
 15.21     11.37     5.51 13355782     0.00     0.00  binomial_prefactor(in
 12.53     15.91     4.54 28326022     0.00     0.00  binomial(int, int)
  9.80     19.46     3.55  4769174     0.00     0.00  gser(double *, double
  9.27     22.82     3.36 144935496     0.00     0.00  fact(int)
  8.11     25.76     2.94  6656510     0.00     0.00  B_term(int, int, int,
  6.35     28.06     2.30   677312     0.00     0.05  coulomb_repulsion(dou
  5.77     30.15     2.09  2031936     0.00     0.01  B_array(int, int, int

Replacing fact_ratio2 with a more intelligent version:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 14.90      5.10     5.10 13355782     0.00     0.00  binomial_prefactor(int
 13.47      9.71     4.61 28326022     0.00     0.00  binomial(int, int)
 10.81     13.41     3.70  4769174     0.00     0.00  gser(double *, dou
 10.37     16.96     3.55 19969530     0.00     0.00  fact_ratio2(int, int)
  9.50     20.21     3.25  6656510     0.00     0.00  B_term(int, int, int,
  7.39     22.74     2.53 104996436     0.00     0.00  fact(int)
  6.78     25.06     2.32   677312     0.00     0.05  coulomb_repulsion(dou
  6.46     27.27     2.21  2031936     0.00     0.01  B_array(int, int, int
  3.83     28.58     1.31 13313020     0.00     0.00  fB(int, int, int, dou


Also consider looking at which values of fact are called most, to
consider caching some more.

Can also accelerate binomial in much the same way.
 %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 16.85      5.68     5.68 13355782     0.00     0.00  binomial_prefactor(
 14.09     10.43     4.75 28326022     0.00     0.00  binomial(int, int)
 11.24     14.22     3.79  4769174     0.00     0.00  gser(double *, doub
 10.38     17.72     3.50 19969530     0.00     0.00  fact_ratio2(int, int)
  8.72     20.66     2.94  6656510     0.00     0.00  B_term(int, int, in
  7.65     23.24     2.58   677312     0.00     0.05  coulomb_repulsion(d
  5.75     25.18     1.94  2031936     0.00     0.01  B_array(int, int, i
  4.78     26.79     1.61 48295552     0.00     0.00  fact_interval(int, int)
  3.98     28.13     1.34 13313020     0.00     0.00  fB(int, int, int, d

According to this, I've slowed the code down, but I'm not sure I can
trust these results.

Put back in optimization and see what the timing is now. 50.3 sec. Seems
like a lot of work for 3 seconds speedup. Does caching 2 more values
help? time now = 49.3 sec. 

Any more than this and ints probably aren't useful anyway. This might
be a problem. See what values are called.

The largest values appears to be 3. Most of the values (by far) are
zero. I put in the line
  if (a+b+c == 0) return 1;
in both binomial and fact_ratio2 to see if this speeds the code
up. 48.9 seconds.

Look at profiling now:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 17.61      5.28     5.28 13355782     0.00     0.00  binomial_prefactor(in
 11.57      8.75     3.47  4769174     0.00     0.00  gser(double *, double
 11.31     12.14     3.39 28326022     0.00     0.00  binomial(int, int)
 10.41     15.26     3.12 19969530     0.00     0.00  fact_ratio2(int, int)
  9.97     18.25     2.99  6656510     0.00     0.00  B_term(int, int, int,
  8.04     20.66     2.41   677312     0.00     0.04  coulomb_repulsion(dou
  6.07     22.48     1.82  2031936     0.00     0.01  B_array(int, int, int
  4.50     23.83     1.35 13313020     0.00     0.00  fB(int, int, int, dou

It looks like this is all high-hanging fruit. Stop worrying about
this. If you want to speed something up, look at Rhys quadrature for
the coulomb repulsion.



----------20 July 2001----------
Things to improve: 
(1) numerical difference btw python and C/C++ codes.
(2) understand why python is taking more time during the SCF
iterations. Is it computing D in python?
(3) try out a recursion relationship for Fgamma


Working on Fgamma first. Look at what the typical arguments are:

Again, (0,0) are the most common calls. Fgamma(0,0) should be 1. See
if this speeds things up. 

I've simplified the Fgamma function a great deal:
double Fgamma(int m, double x){
  if (fabs(x) < SMALL) return 1./(2.*m+1.);
  return 0.5*pow(x,-m-0.5)*gamm_inc(m+0.5,x);
}

I've also made an array of calls from Fgamma.

Todays changes decreased the execution time to 34 sec.

Binomial prefactor and binomial take up most of the rest of the time.
Look at the arguments passed in. 
First rule:
  if (ia == 0 && ib == 0) return 1.;
down to 32 sec.
  if ( ((fabs(xpa)+fabs(xpb)) < SMALL) && s == 0) return 0.;
down to 30 sec.

Here's the profiling:
 17.86      3.06     3.06  6656510     0.00     0.00  B_term(int, int, int
 16.81      5.94     2.88 13355782     0.00     0.00  binomial_prefactor(i
 14.01      8.34     2.40  2031936     0.00     0.01  B_array(int, int, in
 11.85     10.37     2.03   677312     0.00     0.02  coulomb_repulsion(do
  6.65     11.51     1.14 13313020     0.00     0.00  fB(int, int, int, do
  5.90     12.52     1.01 19969530     0.00     0.00  fact_ratio2(int, int)
  5.25     13.42     0.90 13313020     0.00     0.00  B0(int, int, double)
  4.14     14.13     0.71 14546790     0.00     0.00  binomial(int, int)
  3.50     14.73     0.60   679624     0.00     0.00  G_array(int, double)
  2.39     15.14     0.41  2045808     0.00     0.00  dist2(double, double, 
  1.98     15.48     0.34   365518     0.00     0.00  gser(double, double)
  1.75     15.78     0.30  2480534     0.00     0.00  ijkl2intindex(int, int
  1.46     16.03     0.25  4098552     0.00     0.00  product_center_1D(doub
  1.17     16.23     0.20  3210000     0.00     0.00  DMatrix::getel(int, in
  0.93     16.39     0.16   677312     0.00     0.02  PBF::coulomb(PBF &, PB
  0.70     16.51     0.12  2879836     0.00     0.00  fact_interval(int, int)

Still holding at 30 sec.

Did similar tests with GAMESS, which only took 1 sec for LiH.


----------19 Sept 2001----------
Okay, wrote and debugged an implementation of Head-Gordon/Poples
horizontal and vertical recursion relationships. I've checked the
accuracy for a number of different integrals. Problem is that the code
is incredibly slow.

For an integral with 4 d_x2 orbitals, the cints technique takes 0.002
sec, and the hgp technique takes 2.17 sec.

I have not, of course, moved anything into C yet.

Profiling.
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 15818/16    6.480    0.000    8.470    0.529 hgp.py:84(vrr)
    47454    1.990    0.000    1.990    0.000 pyints.py:168(gaussian_product_center)
     31/1    0.020    0.001    8.490    8.490 hgp.py:14(hrr)
        1    0.010    0.010    8.500    8.500 <string>:1(?)

Okay, put vrr into cints.

Testing accuracy: works

Got the timing down from 2.17 sec to .12 sec, but it's still ~60 times
slower than the old way (which I already know is too slow).

Here's the profiling for a larger case (ff|ff):
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    127/1   13.700    0.108   13.700   13.700 hgp.py:15(hrr)
        1    0.000    0.000   13.700   13.700 <string>:1(?)
        4    0.000    0.000    0.000    0.000 PGBF.py:50(powers)

(this case took something like 273 seconds before, so it's a big
improvement).

using the C version of hrr gives:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   13.500   13.500   13.500   13.500 <string>:1(?)
        4    0.000    0.000    0.000    0.000 PGBF.py:50(powers)
        4    0.000    0.000    0.000    0.000 PGBF.py:49(origin)
        1    0.000    0.000   13.500   13.500 profile:0(hrr(b.origin(),b.norm(),

means that it takes the same amount of time in C and Python.

--------20 Sept 2001--------------

The timing is slower for the Contracted hrr calls:
0.0262349843979 1.46051096916

I'm going to put everything in C++ and profile there.

Here are the results, profiling 1000 calls to <z2,z2|z2,z2> using both
hrr and the pbf.coulomb functions:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 57.49     22.71    22.71    16000  1419.38  2428.71  vrr(double, double, doubl
 25.75     32.88    10.17 142368000    0.07     0.07  product_center_1D(double
 10.81     37.15     4.27 47457000     0.09     0.09  dist2(double, double, dou
  4.33     38.86     1.71  9998000     0.17     0.17  Fgamma(int, double)
  0.46     39.04     0.18   364000     0.49     0.94  binomial_prefactor(int, i
  0.33     39.17     0.13  1200000     0.11     0.13  binomial(int, int)
  0.15     39.23     0.06   364000     0.16     1.31  fB(int, int, int, double,
  0.15     39.29     0.06     1000    60.00 38919.30  hrr(double, double, doubl
  0.13     39.34     0.05   546000     0.09     0.13  fact_ratio2(int, int)
  0.13     39.39     0.05   182000     0.27     3.02  B_term(int, int, int, int
  0.08     39.42     0.03   623000     0.05     0.05  fact_interval(int, int)
  0.08     39.45     0.03   364000     0.08     0.21  B0(int, int, double)
  0.05     39.47     0.02   623000     0.03     0.03  fact(int)
  0.05     39.49     0.02     3000     6.67   190.00  B_array(int, int, int, in
  0.03     39.50     0.01     1000    10.00   580.70  coulomb_repulsion(double,
  0.00     39.50     0.00     1000     0.00     0.00  G_array(int, double)
  0.00     39.50     0.00     1000     0.00   580.70  PBF::coulomb(PBF &, PBF &,
  0.00     39.50     0.00       12     0.00     0.00  fact2(int)
  0.00     39.50     0.00        4     0.00     0.00  PBF::~PBF(void)
  0.00     39.50     0.00        4     0.00     0.00  PBF::PBF(double, double, d

My reading of this is that the hrr algorithm (389 sec) takes
significantly longer than the coulomb_repulsion (.581 sec). 

Sigh. Maybe I have to cache the calls. But at this point it doesn't
look like I'm getting any kind of savings, and thus there doesn't seem
to be any point in looking at the hrr calls on contracted functions.

Wouldn't be a bad idea to look at the Gill/Head-Gordon/Pople
technique. Maybe they can explain how they cache things.

I should also finish up the rys code.

---------1 May 2002-------------------

Wow. Big layoff since I've worked on this. Got both rys and hgp code
working. C version of HGP also done. Studying timings for integrals
only:

                -------Python-----	 --------C---------
		Taketa	Rys	HGP	 Taketa	Rys	HGP
h2/6-31G**	18.8	12.0	22.8	 1.2		1.4
h2o/6-31G**	2155		10851	 89.6		324.9

Still can't do rys on h2o, since it needs higher order roots than I've
programmed thus far.

Profile the working C h2o cases:

Taketa:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9  245.950   27.328  245.950   27.328 hartree_fock.py:94(get2JmK)
    52975   96.240    0.002  221.080    0.004 CGBF.py:79(coulomb)
   441414   77.610    0.000   77.610    0.000 PGBF.py:126(coulomb)
  1783512   42.540    0.000   42.540    0.000 PGBF.py:58(coef)
   218200    5.370    0.000    5.370    0.000 CGBF.py:37(norm)
        1    4.450    4.450  225.530  225.530 hartree_fock.py:47(get2ints)
     1875    0.950    0.001    1.810    0.001 CGBF.py:72(nuclear)
     1764    0.490    0.000    0.490    0.000 PGBF.py:69(kinetic)
     5292    0.480    0.000    0.480    0.000 PGBF.py:94(nuclear)
       10    0.320    0.032    0.370    0.037 LA2.py:30(SymOrth)
        1    0.280    0.280    3.490    3.490 hartree_fock.py:32(get1ints)
      650    0.250    0.000    0.600    0.001 CGBF.py:58(overlap)
     1872    0.220    0.000    0.220    0.000 PGBF.py:64(overlap)
      625    0.200    0.000    0.860    0.001 CGBF.py:65(kinetic)
       20    0.100    0.005    0.120    0.006 LinearAlgebra.py:246(Heigenvectors)

HGP:
   441414  323.240    0.001  323.240    0.001 PGBF.py:126(coulomb)
        9  220.710   24.523  220.710   24.523 hartree_fock.py:94(get2JmK)
    52975  102.500    0.002  473.890    0.009 CGBF.py:79(coulomb)
  1783512   43.370    0.000   43.370    0.000 PGBF.py:58(coef)
   218200    5.300    0.000    5.300    0.000 CGBF.py:37(norm)
        1    4.810    4.810  478.700  478.700 hartree_fock.py:47(get2ints)
     1875    0.840    0.000    1.620    0.001 CGBF.py:72(nuclear)
     5292    0.480    0.000    0.480    0.000 PGBF.py:94(nuclear)
     1764    0.430    0.000    0.430    0.000 PGBF.py:69(kinetic)
        1    0.340    0.340    3.210    3.210 hartree_fock.py:32(get1ints)
       10    0.300    0.030    0.350    0.035 LA2.py:30(SymOrth)
      650    0.280    0.000    0.560    0.001 CGBF.py:58(overlap)
      625    0.220    0.000    0.710    0.001 CGBF.py:65(kinetic)
     1872    0.120    0.000    0.120    0.000 PGBF.py:64(overlap)
       20    0.050    0.003    0.090    0.004 LinearAlgebra.py:246(Heigenvectors)

It strikes me that I can construct a call to a C integral routine that
will do contracted integrals in C, to avoid the loops in python in
CGBF.coulomb(). I can do this for Taketa, Rys, and HGP integrals as
well.

Here's the profiling, calling contr_coulomb():
Taketa:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10  256.880   25.688  256.880   25.688 hartree_fock.py:94(get2JmK)
    52975   96.110    0.002  210.760    0.004 CGBF.py:95(coulomb)
   211900   23.860    0.000   33.220    0.000 CGBF.py:43(exps)
   211900   23.730    0.000   32.940    0.000 CGBF.py:53(pnorms)
   211900   23.570    0.000   32.760    0.000 CGBF.py:48(coefs)
   373848    9.590    0.000    9.590    0.000 PGBF.py:58(coef)
   355992    9.360    0.000    9.360    0.000 PGBF.py:54(exp)
   355992    9.210    0.000    9.210    0.000 PGBF.py:57(norm)
   211900    5.320    0.000    5.320    0.000 CGBF.py:39(origin)
   211900    5.260    0.000    5.260    0.000 CGBF.py:40(powers)
   218200    5.260    0.000    5.260    0.000 CGBF.py:38(norm)
        1    3.890    3.890  214.650  214.650 hartree_fock.py:47(get2ints)
     1875    0.940    0.001    1.580    0.001 CGBF.py:88(nuclear)
     1764    0.540    0.000    0.540    0.000 PGBF.py:69(kinetic)
     5292    0.370    0.000    0.370    0.000 PGBF.py:94(nuclear)

Doesn't look like it changed very much.

I wonder how all of this would change with optimization? I'm compiling
now without any.

                -------Python-----	 --------C---------
		Taketa	Rys	HGP	 Taketa	Take2	Rys	HGP	HGP2
h2/6-31G**	18.8	12.0	22.8	 1.1	1.2		1.2
h2o/6-31G**	2155		10851	 75.7	114.3		263.8	429.1

Don't understand why it's so much slower doing the iteration in
C. HGP2/Take2 both should have been much faster than HGP/Taketa.


--------------3 May 2002-----------------------------------

                -------Python-----	 --------C---------
		Taketa	Rys	HGP	 Taketa	Take2	Rys	HGP	HGP2
h2/6-31G**	18.8	12.0	22.8	 1.1	1.2		1.2
h2o/6-31G**	2155		10851	 75.7	66.6		263.8	259.1

I'm beginning to think that the slowdowns with the "2" versions I saw
were simply due to fluctuations in the runtimes. I've gotten runtimes
ranging from 80 to 68. So that explains that mystery, but I still
don't understand why I don't get more of a speedup when I call HGP2,
for example.

Redoing the PySequence calls with PyList calls to see if it speeds
things up. It might, but causes a Seg Fault. Trying
PySequence_Fast_GET_ITEM also SegFaults.


See if I can speed code up by only allocating one large space of
memory in the module and then reusing it.

                -------Python-----	 --------C---------
		Taketa	Rys	HGP	 Taketa	Take2	Rys	HGP	HGP2
h2/6-31G**	18.8	12.0	22.8	 1.1	1.2		1.2
h2o/6-31G**	2155		10851	 75.7	65.6		263.8	259.0

Not much difference in performance, but I guess that every bit helps.

From hereafter the "2" versions will be the default.

Profile:
Taketa:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9  226.110   25.123  226.110   25.123 hartree_fock.py:94(get2JmK)
    52975   86.570    0.002  203.440    0.004 CGBF.py:97(coulomb)
   211900   24.980    0.000   33.520    0.000 CGBF.py:50(coefs)
   211900   24.870    0.000   34.080    0.000 CGBF.py:55(pnorms)
   211900   24.520    0.000   33.540    0.000 CGBF.py:45(exps)
   355992    9.210    0.000    9.210    0.000 PGBF.py:57(norm)
   355992    9.020    0.000    9.020    0.000 PGBF.py:54(exp)
   373848    8.880    0.000    8.880    0.000 PGBF.py:58(coef)
   211900    5.390    0.000    5.390    0.000 CGBF.py:41(origin)
   218200    5.330    0.000    5.330    0.000 CGBF.py:40(norm)
   211900    5.110    0.000    5.110    0.000 CGBF.py:42(powers)
        1    4.660    4.660  208.100  208.100 hartree_fock.py:47(get2ints)
     1875    0.840    0.000    1.510    0.001 CGBF.py:90(nuclear)
     1764    0.490    0.000    0.490    0.000 PGBF.py:69(kinetic)
      650    0.410    0.001    0.620    0.001 CGBF.py:76(overlap)

Realized that I'm recomputing the CGBF pnorms, exps, etc. Eliminating
this by doing it all at construction/normalization:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9  233.370   25.930  233.370   25.930 hartree_fock.py:94(get2JmK)
    52975   85.090    0.002  116.420    0.002 CGBF.py:92(coulomb)
   211900    5.380    0.000    5.380    0.000 CGBF.py:44(origin)
   218200    5.370    0.000    5.370    0.000 CGBF.py:43(norm)
   211900    5.320    0.000    5.320    0.000 CGBF.py:50(pnorms)
   211900    5.260    0.000    5.260    0.000 CGBF.py:45(powers)
   211900    5.110    0.000    5.110    0.000 CGBF.py:48(exps)
   211900    5.090    0.000    5.090    0.000 CGBF.py:49(coefs)
        1    3.580    3.580  120.000  120.000 hartree_fock.py:47(get2ints)
     1875    0.860    0.000    1.680    0.001 CGBF.py:85(nuclear)
     1764    0.510    0.000    0.510    0.000 PGBF.py:69(kinetic)
    17856    0.430    0.000    0.430    0.000 PGBF.py:58(coef)
     5292    0.430    0.000    0.430    0.000 PGBF.py:94(nuclear)
      625    0.360    0.001    0.980    0.002 CGBF.py:78(kinetic)
      650    0.330    0.001    0.590    0.001 CGBF.py:71(overlap)

Much better. Redo timings without profiling.


May 4, 2002
Getting crys to work:

                -------Python-----	 --------C---------
		Taketa	Rys	HGP	 Taketa	Rys	HGP
h2/6-31G**	18.8	12.0	22.8	 0.77	0.49	0.88
h2o/6-31G**	2064	1191	10837	 44.2	21.5	238.0

Great! Now I have to speed up the get2JmK part.

Brought the overall scf time from 230 to 72 by doing dot products.
Brought the overall scf timing down to 38 by only doing half of the
dot products.

Got the code to work under Cygwin! Little bit slower, requiring 30/100
seconds as opposed to 22/38 on intel5.

Profiling using crys on cygwin for h2o:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5537350  322.336    0.000  322.336    0.000 pyints.py:130(ijkl2intindex)
        9  317.986   35.332  637.980   70.887 hartree_fock.py:94(get2JmK)
    52975   81.494    0.002  122.820    0.002 CGBF.py:95(coulomb)
   218200    7.345    0.000    7.345    0.000 CGBF.py:46(norm)
   211900    7.161    0.000    7.161    0.000 CGBF.py:47(origin)
        1    7.025    7.025  132.588  132.588 hartree_fock.py:47(get2ints)
   211900    7.021    0.000    7.021    0.000 CGBF.py:48(powers)
   211900    6.992    0.000    6.992    0.000 CGBF.py:51(exps)
   211900    6.617    0.000    6.617    0.000 CGBF.py:53(pnorms)
   211900    6.460    0.000    6.460    0.000 CGBF.py:52(coefs)
     1875    1.151    0.001    2.493    0.001 CGBF.py:88(nuclear)
     5292    0.861    0.000    0.861    0.000 PGBF.py:95(nuclear)
    17856    0.633    0.000    0.633    0.000 PGBF.py:59(coef)
     1764    0.601    0.000    0.601    0.000 PGBF.py:70(kinetic)
     2982    0.401    0.000    0.401    0.000 Numeric.py:330(dot)

Ooops. Using pyints version of ijkl2intindex. This is now running in
30/45 seconds.

Here's the updated profiling:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    52975   94.148    0.002  137.349    0.003 CGBF.py:95(coulomb)
        9   48.949    5.439   49.330    5.481 hartree_fock.py:94(get2JmK)
   211900    7.607    0.000    7.607    0.000 CGBF.py:51(exps)
   211900    7.582    0.000    7.582    0.000 CGBF.py:52(coefs)
   218200    7.521    0.000    7.521    0.000 CGBF.py:46(norm)
   211900    7.220    0.000    7.220    0.000 CGBF.py:53(pnorms)
   211900    6.840    0.000    6.840    0.000 CGBF.py:47(origin)
   211900    6.621    0.000    6.621    0.000 CGBF.py:48(powers)
        1    5.476    5.476  142.825  142.825 hartree_fock.py:47(get2ints)
     1875    1.280    0.001    2.611    0.001 CGBF.py:88(nuclear)
     5292    0.881    0.000    0.881    0.000 PGBF.py:95(nuclear)
     1764    0.680    0.000    0.680    0.000 PGBF.py:70(kinetic)
    17856    0.520    0.000    0.520    0.000 PGBF.py:59(coef)
     2982    0.431    0.000    0.431    0.000 Numeric.py:330(dot)
      650    0.420    0.001    0.720    0.001 CGBF.py:74(overlap)


May 21, 2002
Working on grids for DFT.

Went through a long frustrating experiment with RectangularGrid. Using
very dense grids I was able to get the norm of the orbital down to 4.1
(it should have been 1.0).

But now I see that I'm making a mistake somewhere, because using my
radial grids, now, I'm getting exactly the same result?! Clearly I'm
getting a constant wrong somewhere.

What follows is the norm of <orb[0]|orb[0]>, computed on the grid:

Rectangular grids:
11^3: 127.97
21^3: 17.66
41^3: 4.86
81^3: 4.1014

Gauss-Legendre Radial, Lebedev angular:
default: 4.10045
high-ang: 4.10045

Looks like we're converging. But why the factor of 4.1??

Fixed a bug (needed to multiply by the CGBF normalization and by the
PGBF coefficient):

Grid:        S00	Nel
R(11^3)	     2.62	2.62
R(21^3)	     1.107	1.107
R(41^3)	     1.003	1.003
Sphere	     0.99999	1.0

Cool!

2002-07-09
Comparing the result of this to molpro. Almost exactly the same, now.
Printing out rho,ex,vx gives nearly the same results. Still, there
are major discrepancies in the values that derive from these differ
significantly.
They're multiplying all of the vx's by a constant
 (fac=dftfac(idftfu))
...except when I check, the constant is 1. And I still can't figure out
how the energies are different, since my energy at each point is very
close to the MP value, and yet the total is very different.

2002-07-12
Comparing my results to DeFT to make sure I'm doing everything right.

First step: atomic grids for He.

I had to do several hacks to make this work.
1. Deft only uses 26 points per angular grid. I changed my code to do
   this.
2. Deft does a random spin of each angular grid, which I removed.
3. Deft prunes the grids based (?) on the character of the basis
   functions at each grid point.

Working around these differences, we are now making exactly the same
grid.

Hard to figure out what's going on in deft. Can't find a simple place
where they sum the xc energy. And since the programs are different,
it's hard to really compare what's going on. still, it seems that I'm
in the right ballpark for the xc energy and potential. Strange that I
would be so off the sums, though.

2002-07-13
What do I know. The grids and the weights are correct. I'm inputting
the orbitals from jaguar, so those are correct. And the densities that
I form at each grid point appears to be correct to within an order of
magnitude or so. I know that ex and vx are incorrect, despite the fact
that I've used every possible way to make them.

The Pople form seems to be the closest to the correct energy. But the
potential is really off.

Oooh. I just realized that if I use the Molpro version, and if I
remove the dens in the quadrature of the energy, that I get the right
value of the energy. But the integration of the potential is still
off. The potential integration seems to be better if I omit the dens,
and in a way it makes sense, because bf[a].amp(x,y,z)*bf[b].amp(x,y,z)
is kinda density like.

Using a fudge factor of 0.5 gets me much closer to the proper values,
0.25 is even better, 0.1 is even better.

Ah, it works. I had J in instead 2J. Now I don't need a fudge factor
in anyway.


2002-07-15
Tried using the VWN function from Perdew's files. Doesn't quite
work. The function converges, which is probably a good sign, and may
indicate that vc is alright even though ec is not:

-9.02822489392 -3.85528309878 2.00366966161 -7.17661145675
-8.98410193437 -3.86641823781 2.02373847181 -7.14142216838
-8.99989777603 -3.86252180718 2.0165996113 -7.15397558015
-8.99430791081 -3.86391193417 2.01913168742 -7.14952766406
-8.99629423245 -3.86341938294 2.01823265712 -7.15110750663
-8.99558943653 -3.86359433115 2.0185517467 -7.15054685209
-8.9958396452 -3.86353224555 2.01843847863 -7.15074587828
-8.99575083539 -3.86355428526 2.01847868384 -7.15067523397

Some questions to answer:
1. How do I cut off the seitz radius?
2. There is an x term (and a xx(x) array) in the molpro dftfun.f
routine. What is this and how do I work with it?

2002-07-16
Molpro VWN works. Looks like I match what Jaguar calls VWN5. But I
match it to many decimal places!

Priorities from here on out:

- Grids for many atom systems
- DFT for many atom systems
- Clean up dft modules!
- Simple speed improvements for DFT/LDA
----Release 1.0---
- Forces
- Divide and Conquer
----Release 1.1---
- LSD
  - HH hybrid functionals (since they use LDA only?)?
- ROHF
- GVB
----Release 1.2---
- GGA
- Hybrid functionals
----Release 1.3---
- CI/MCSCF
- MPn/CCSD
----Release 1.4---

Testing on Ne before I go on to many-atom molecules:

	Jaguar conv:  My results:
Etot   -128.141 902   -128.142
Eone   -182.504 570   -182.499
Etwo     54.362 668    
 Ej	 66.130 186	66.124
 Exc	-11.767 518    -11.767

Pretty damn good! Getting an error message about Nel being off in
setdens.

Can we do some more efficient orbital updating scheme, where the
averaging parameter is optimized each iteration? We needed averaging
to converge Ne.

Fixed the setdens bug: it was really just a print bug, since the
program already correctly made sure that tot_ne = Z, but then printed
out tot_ne and Z/2. This has now been corrected. Have also decreased
the tolerance above which the program complains about the density
integration to 0.05 e. Seems silly to complain if we're only missing
0.0001 electron!

Retesting with full grid. For He it raises the number of points from
832 to 1792. I have made what used to be 'coarse' now called 'medium',
and have introduced a new 'coarse'. The new coarse is still more than
good enough for He.

Running different grids for Ne. Seems to work perfectly well with the
new coarse (556 points).

Working on a grid pruning scheme. The idea is to remove a gridpoint if
(w_i * dens_i) is less than 0.00001. This is a very small cutoff, so I
think it's a pretty safe bet. Pruning cuts the coarse grid sizes down
by 1/2 without changing the answer.

Made a new module called MolecularGrid. Put everything from
atomic_dft into dft. Confirmed that dft works for He and Ne atoms.

Here's the profiling for He using the atomic code:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   368620   18.440    0.000   18.440    0.000 PGBF.py:127(amp)
   263300   16.490    0.000   34.930    0.000 CGBF.py:122(amp)
        8    5.520    0.690   26.030    3.254 atomic_dft.py:43(getXC)
        8    4.830    0.604   23.300    2.912 atomic_dft.py:64(setdens)
     7536    1.100    0.000    1.680    0.000 dft.py:89(vwn_eps)
     7536    0.940    0.000    1.280    0.000 dft.py:96(vwn_deps)
    52752    0.920    0.000    0.920    0.000 dft.py:47(xx)
     2512    0.520    0.000    3.480    0.001 dft.py:49(VWN)
    10532    0.200    0.000    0.200    0.000 AtomicGrid.py:83(__getitem__)
     5266    0.150    0.000    0.150    0.000 GridPoint.py:42(xyzw)
      120    0.120    0.001    0.140    0.001 CGBF.py:128(coulomb)
     2512    0.120    0.000    0.120    0.000 dft.py:40(S)
     2754    0.070    0.000    0.070    0.000 GridPoint.py:44(setdens)
        1    0.050    0.050    0.070    0.070 AtomicGrid.py:58(__init__)
      640    0.030    0.000    0.030    0.000 CGBF.py:68(norm)

Here's the profiling for Ne:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  6301680  326.400    0.000  326.400    0.000 PGBF.py:127(amp)
  3375900  262.590    0.000  588.990    0.000 CGBF.py:122(amp)
       12   70.120    5.843  354.310   29.526 atomic_dft.py:43(getXC)
       12   56.850    4.737  366.660   30.555 atomic_dft.py:64(setdens)
     7260   11.360    0.002   13.860    0.002 CGBF.py:128(coulomb)
       12    1.780    0.148    1.830    0.153 hartree_fock.py:102(getJ)
    10872    1.340    0.000    2.070    0.000 dft.py:89(vwn_eps)
    76104    1.310    0.000    1.310    0.000 dft.py:47(xx)
    10872    1.120    0.000    1.700    0.000 dft.py:96(vwn_deps)
     3624    0.610    0.000    4.380    0.001 dft.py:49(VWN)
    29040    0.460    0.000    0.460    0.000 CGBF.py:75(pnorms)
    30420    0.440    0.000    0.440    0.000 CGBF.py:68(norm)
    29040    0.430    0.000    0.430    0.000 CGBF.py:74(coefs)
    29040    0.430    0.000    0.430    0.000 CGBF.py:70(powers)
    29040    0.410    0.000    0.410    0.000 CGBF.py:73(exps)

No real surpises here. Also tested the dft.py routine to insure that
there aren't any additional sources of errors here. Seems to be the
same (no surprise, really). Will soon altogether remove atomic_dft.py
and use dft.py in place of it.

Am not worrying too much about speed now, but would like things to be
a bit quicker. Need to figure out how to make fewer calls to the
bf.amp() functions. I could build a collocation matrix like PSGVB did.

Also need to figure out a more elegant way of dealing with the basis
functions. Would be nice if we had one grid loop rather than looping
over the atomic grids and then each point in the grids.

Okay, finally tried water. It does not work. Starting from the
converged orbitals yields:

      Jaguar	    PyQuante
Nel   10	    8.26
Etot   -75.842 134   -82.349
Eone  -121.514 214  -111.514
Etwo    37.318 570    
 Ej     46.003 565    37.006
 Ek	-8.684 995    -7.841 

Trying to figure out what's wrong. Running H2 now. The energy doesn't
change much when I remove the points, but maybe that's because I've
already set the weights to zero.

Taking out the part of the code that sets the weights to zero if
they're closer to another atom:

Here's the convergence of the whole program:
Pruning grid of 207 out of 556 points
Pruning grid of 207 out of 556 points
-1.79455124228 -2.55981568889 1.55268909101 -0.787424644398
-1.84905401535 -2.46890672837 1.25504607477 -0.635193361748
-1.85348960466 -2.51278224677 1.33569207109 -0.676399428983

Here's the convergence with the point deletion removed:

Pruning grid of 154 out of 556 points
Pruning grid of 154 out of 556 points
Warning: setdens: Nel is 4.309743, should be 2
-2.79464600771 -2.55981568889 1.55268909101 -1.78751940982
Warning: setdens: Nel is 4.572841, should be 2
-2.92924220311 -2.43680485827 1.71481521335 -2.20725255819
Warning: setdens: Nel is 4.645622, should be 2
-2.93811726907 -2.36806580836 1.73072694247 -2.30077840319

So it's doing something. 

Tried a variety of other things. I've tried removing the points in
different ways, to make sure that the right points are removed. I've
also tried running with denser grids. All give essentially the same
results. 

2002-07-18
Trying to get H2 to work with patched atomic grids.
Jaguar gets an energy of -1.114 079 h
 Energy components, in hartrees:
   (A)  Nuclear repulsion............      .52917724900
   (E)  Total one-electron terms.....    -2.20508096119
   (I)  Total two-electron terms.....      .56182470887
   (J)    Coulomb....................     1.15371319964
   (K)    Exchange+Correlation.......     -.59188849076
   (L)  Electronic energy............    -1.64325625232  (E+I)
   (N)  Total energy.................    -1.11407900332  (A+L)
  
Here's what PyQuante gets:
-1.0997336019 -2.24695554787 1.32835404381 -0.710312097843 0.52918
-1.13437918065 -2.1984383215 1.14036225861 -0.605483117757 0.52918
-1.13620796314 -2.21789752039 1.18078081958 -0.628271262327 0.52918
-1.13630031017 -2.21394192282 1.17153051464 -0.62306890199 0.52918

Not far off. Try starting from converged result:
-1.10958253654 -2.14486542145 1.09309311878 -0.586990233869 0.52918
-1.13583964717 -2.22148089392 1.18979409859 -0.633332851842 0.52918
-1.1362811831 -2.21303573172 1.16950106377 -0.621926515152 0.52918
-1.13630410018 -2.21506342198 1.17408643984 -0.624507118039 0.52918

Try running with a denser grid:
-1.08598324092 -2.14486542145 1.09309311878 -0.563390938248 0.52918
-1.10441427271 -2.20664736462 1.15699335278 -0.583940260865 0.52918
-1.10461440659 -2.19969522289 1.14378348062 -0.577882664322 0.52918
-1.10462466103 -2.20131744044 1.14676673635 -0.579253956936 0.52918

Getting pretty close. Probably still a few bugs, but looking good. Try
without Jaguar input guess.


Things to try:
1. Rotating the grids as DeFT does
2. Becke projection


Rotating the grids helps a great deal:
-1.0684549559 -2.24695554787 1.32835404381 -0.679033451841 0.52918
-1.1130065046 -2.18421965876 1.11747099219 -0.57543783804 0.52918
-1.11537969801 -2.20949582376 1.16241814453 -0.597482018775 0.52918
-1.11549828959 -2.2043823584 1.15213781528 -0.592433746474 0.52918
-1.11550438883 -2.20556948221 1.15445791636 -0.593572822983 0.52918

Consider computing a collocation matrix: the amplitude of the basis
functions at each grid point. All of the bf.amp() calls that I do
during the XC computation.

I can do this in a number of different ways. I can compute a huge
collocation matrix, or I can store the info on each grid point.
The latter might be the most natural.

Ended up implementing the collocation matrix, because the code is too
slow for molecules. I distributed the collocation elements. The
downside to this is that it increases the memory, perhaps
unnecessarily, but I can worry about memomory later.

Here's the new profiling on Ne:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       12   71.570    5.964  104.290    8.691 dft.py:126(getXC)
       12   56.540    4.712   85.080    7.090 dft.py:103(setdens)
  2579039   41.700    0.000   41.700    0.000 AtomicGrid.py:86(__getitem__)
   824014   13.940    0.000   13.940    0.000 GridPoint.py:42(xyzw)
     7260   12.690    0.002   15.520    0.002 CGBF.py:128(coulomb)
       12    2.090    0.174    2.160    0.180 hartree_fock.py:107(getJ)
    10872    1.710    0.000    2.430    0.000 dft.py:89(vwn_eps)
    10872    1.600    0.000    2.140    0.000 dft.py:96(vwn_deps)
    76104    1.260    0.000    1.260    0.000 dft.py:47(xx)
    15568    0.910    0.000    0.910    0.000 PGBF.py:127(amp)
     3624    0.820    0.000    5.390    0.001 dft.py:49(VWN)
     8340    0.800    0.000    1.710    0.000 CGBF.py:122(amp)
    29040    0.600    0.000    0.600    0.000 CGBF.py:69(origin)
    29040    0.510    0.000    0.510    0.000 CGBF.py:75(pnorms)
    29040    0.510    0.000    0.510    0.000 CGBF.py:73(exps)

Reversed the loops in getXC and made it a lot faster:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       12   59.210    4.934   87.460    7.288 dft.py:103(setdens)
  1764575   28.180    0.000   28.180    0.000 AtomicGrid.py:86(__getitem__)
       12   19.470    1.622   25.770    2.147 dft.py:126(getXC)
     7260   12.940    0.002   15.300    0.002 CGBF.py:128(coulomb)
    10872    1.930    0.000    2.690    0.000 dft.py:89(vwn_eps)
       12    1.890    0.158    1.960    0.163 hartree_fock.py:107(getJ)
    10872    1.560    0.000    2.260    0.000 dft.py:96(vwn_deps)
    76104    1.460    0.000    1.460    0.000 dft.py:47(xx)
    15568    0.970    0.000    0.970    0.000 PGBF.py:127(amp)
     3624    0.900    0.000    5.850    0.002 dft.py:49(VWN)
     8340    0.830    0.000    1.800    0.000 CGBF.py:122(amp)
    29040    0.460    0.000    0.460    0.000 CGBF.py:74(coefs)
        1    0.450    0.450   15.750   15.750 hartree_fock.py:89(get2ints)
    29040    0.430    0.000    0.430    0.000 CGBF.py:73(exps)
    30420    0.430    0.000    0.430    0.000 CGBF.py:68(norm)


Got the program much faster:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       12   20.380    1.698   33.120    2.760 dft.py:103(getXC)
     7260   12.600    0.002   15.520    0.002 CGBF.py:128(coulomb)
     3878    7.060    0.002    7.060    0.002 GridPoint.py:55(getdens)
       12    1.980    0.165    2.070    0.172 hartree_fock.py:107(getJ)
    11028    1.670    0.000    2.620    0.000 dft.py:89(vwn_eps)
    77196    1.470    0.000    1.470    0.000 dft.py:47(xx)
    11028    1.290    0.000    1.810    0.000 dft.py:96(vwn_deps)
     3878    0.890    0.000    5.320    0.001 dft.py:49(VWN)
    15568    0.810    0.000    0.810    0.000 PGBF.py:127(amp)
     8340    0.790    0.000    1.600    0.000 CGBF.py:122(amp)
    29040    0.590    0.000    0.590    0.000 CGBF.py:74(coefs)
    29040    0.570    0.000    0.570    0.000 CGBF.py:73(exps)
    29040    0.480    0.000    0.480    0.000 CGBF.py:69(origin)
    29040    0.460    0.000    0.460    0.000 CGBF.py:70(powers)
    30420    0.430    0.000    0.430    0.000 CGBF.py:68(norm)
...basically by combining getXC and setdens


Running water now:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       12  165.930   13.828  244.250   20.354 dft.py:41(getXC)
    52975   66.360    0.001   86.450    0.002 CGBF.py:128(coulomb)
    11898   58.670    0.005   58.670    0.005 GridPoint.py:55(getdens)
       12   15.000    1.250   15.280    1.273 hartree_fock.py:107(getJ)
    35088    6.180    0.000    8.830    0.000 DFunctionals.py:84(vwn_eps)
   245616    4.650    0.000    4.650    0.000 DFunctionals.py:42(xx)
    35088    4.600    0.000    6.600    0.000 DFunctionals.py:91(vwn_deps)
    70056    4.220    0.000    4.220    0.000 PGBF.py:127(amp)
    41700    3.600    0.000    7.820    0.000 CGBF.py:122(amp)
   218200    3.570    0.000    3.570    0.000 CGBF.py:68(norm)
   211900    3.450    0.000    3.450    0.000 CGBF.py:73(exps)
        1    3.410    3.410   89.860   89.860 hartree_fock.py:89(get2ints)
   211900    3.390    0.000    3.390    0.000 CGBF.py:69(origin)
   211900    3.350    0.000    3.350    0.000 CGBF.py:75(pnorms)
   211900    3.230    0.000    3.230    0.000 CGBF.py:70(powers)
Much slower. May not be able to speed up getXC much more.


---------------------------2002-07-21---------------------------------------
Looking at the water energy. Having a hard time converging the result,
even with averaging = 0.5. Plus I'm not integrating the density
accurately, even when I use the medium grids for everything.

I should derive decent grid parameters for each atom. Something like

h_grid = (Rmax,[24,24,26,26,32,32,24,24,24,24])

or maybe (Rmax,AngGridListCoarse,AngGridListMed,AngGridListFine)

Should also code up Becke projections, which will make a difference
when we start to move atoms around and want smooth scaling.

My results for water (1.0,90)
Term	   PyQuante	   Jaguar
Etot	   -75.6750432671  -75.850044
Eone	   -122.7649602	   -122.403 039
EJ	   46.8619817374   46.445 8
Ek	   -8.61313157115  -8.733 785
Enuke	   8.84106676647   8.841020

Should also test whether I'm pruning too many points.

But there may be some problems with water. I don't get the same result
as Jaguar when I start from the converged wfn:

-73.8921087218 -111.786676747 36.896660271 -7.84315901198 8.84106676647
-75.8650732967 -122.456252222 46.5061055792 -8.75599341997 8.84106676647

which makes me think that my grid isn't good enough. I noticed that
when I start with Jaguar's result, not as many points are pruned from
the H potential, which suggests that I may be a bit too aggressive
there.

Looks like I eventuallly get the right results, though.

I'll change the code to not do the main prune, and to only prune the
points that have their weight set to zero by the patching code.

Here are the noprune results:
First iter:
-73.8761511325 -111.786676747 36.896660271 -7.8272014227 8.84106676647
Last:
-75.854837765 -122.410716443 46.4546582514 -8.73984634016 8.84106676647


2002-07-20

I'm still getting the wrong number of electrons on some of these
iterations. Should I renormalize via the weights?

The timing for the above water case is: 1286.870u 10.550s 26:43.23 
...this version is saved in getXC_old

Using a dot product for the XC matrix: 612.910u 6.640s 13:05.76 78.8%
...but the final energy was a little bit different (-75.844). This
version is saved in getXC_2

Writing a few functions now like w(), v(), e(), and dab() so that I
can do the exercises in getXC in a more matlab-like vector manner.

Interesting that as the wave function converges the tot_nel goes to
10. Maybe that's because the minimum energy is the one that has the
proper densities on the gridpoints.

Timing with vector code: 616.940u 9.240s 13:53.84 75.0%
Close enough that I'll keep it.

Maybe I can think of a better way to do the randomization of each grid
shell. Just staggering each grid by a specified amount, so that there
isn't a variation between grids. Another reason to rethink the way
that I'm putting grids together.

Putting in better timers.
The .bfprod routine took ~49-55 sec/iteration
The older way when I iterated over the grid took 43.8 sec per
iteration

Decreased down to 39.6 sec per iteration with a grid.bf(i) vector
function

A tiny bit faster, 39.1 sec per iter, doing the math a little
different.

Putting the nel renormalization back in takes a little bit of time
(1/2 sec per iteration), but is worth it because it provides less
variation between different grid orientations.

2002-07-22

Here's the current profiling on the water molecule
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    79781  359.940    0.005  359.940    0.005 GridPoint.py:63(setdens)
    33150  198.850    0.006  198.850    0.006 AtomicGrid.py:159(bfs)
    52975   58.560    0.001   76.690    0.001 CGBF.py:128(coulomb)
   232437   32.670    0.000   48.200    0.000 DFunctionals.py:84(vwn_eps)
  1627059   26.830    0.000   26.830    0.000 DFunctionals.py:42(xx)
   232437   26.670    0.000   37.970    0.000 DFunctionals.py:91(vwn_deps)
       17   19.820    1.166   20.130    1.184 hartree_fock.py:107(getJ)
    79781   17.060    0.000  103.230    0.001 DFunctionals.py:44(VWN)
    33354   12.930    0.000   12.930    0.000 Numeric.py:229(concatenate)
   197106    9.960    0.000    9.960    0.000 PGBF.py:127(amp)
   117325    8.680    0.000   18.640    0.000 CGBF.py:122(amp)
    79781    7.160    0.000  113.230    0.001 GridPoint.py:80(setxc)
    11050    6.310    0.001  217.960    0.020 MolecularGrid.py:139(bfs)
       17    4.120    0.242  716.020   42.119 dft.py:42(getXC)
     4693    3.510    0.001   22.180    0.005 GridPoint.py:73(set_bf_amps)

Got the per iter timing down below 30 (29.6!) by rewriting
GripPoint.setdens() to use dot products. May even be able to speed
this up more by using matrix multiplies. Don't have any great ideas
with regard to AtomicGrid.bfs() yet.

For some reason the code slowed down a great deal when I put in NumPy
arrays for the basis functions. Don't know why. Maybe the accessor
function is too slow.

Turns out it was the accessor function, not the NumPy array. The
matrix multiply brings the per iteration time to 12.5 sec! For a full
simulation the time went up to 16.8 sec/iter, with a total time of 300
sec.

Current profiling:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29250  190.090    0.006  190.090    0.006 AtomicGrid.py:151(bfs)
    52975   56.260    0.001   73.480    0.001 CGBF.py:128(coulomb)
   204795   28.340    0.000   42.140    0.000 DFunctionals.py:84(vwn_eps)
   204795   24.210    0.000   34.200    0.000 DFunctionals.py:91(vwn_deps)
  1433565   23.790    0.000   23.790    0.000 DFunctionals.py:42(xx)
       15   17.320    1.155   17.540    1.169 hartree_fock.py:107(getJ)
    70380   14.070    0.000   90.410    0.001 DFunctionals.py:44(VWN)
    29430   10.710    0.000   10.710    0.000 Numeric.py:229(concatenate)
   197064    9.880    0.000    9.880    0.000 PGBF.py:127(amp)
   117300    8.340    0.000   18.220    0.000 CGBF.py:122(amp)
   155130    6.990    0.000    6.990    0.000 Numeric.py:330(dot)
    70380    5.540    0.000   98.430    0.001 GridPoint.py:78(setxc)
     9750    5.420    0.001  206.170    0.021 MolecularGrid.py:130(bfs)
    70380    4.580    0.000    8.990    0.000 GridPoint.py:65(setdens)
     4692    3.520    0.001   21.850    0.005 GridPoint.py:70(set_bf_amps)

Formed a collocation matrix each iteration to avoid the
AtomicGrid.bfs() bottleneck. Down to 13.4 sec/iter.  Realize that I
only have to form this matrix once. Down to 3.7 sec/iter, and 119 sec
total.

Profiling again:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    52975   59.200    0.001   76.570    0.001 CGBF.py:128(coulomb)
   204210   30.100    0.000   43.470    0.000 DFunctionals.py:84(vwn_eps)
   204210   24.980    0.000   35.040    0.000 DFunctionals.py:91(vwn_deps)
  1429470   23.430    0.000   23.430    0.000 DFunctionals.py:42(xx)
       15   17.560    1.171   17.840    1.189 hartree_fock.py:107(getJ)
    70140   15.170    0.000   93.680    0.001 DFunctionals.py:44(VWN)
   196392   10.050    0.000   10.050    0.000 PGBF.py:127(amp)
   116900    8.360    0.000   18.410    0.000 CGBF.py:122(amp)
   154650    8.260    0.000    8.260    0.000 Numeric.py:330(dot)
     4859    7.760    0.002    7.760    0.002 Numeric.py:229(concatenate)
    70140    5.880    0.000  102.300    0.001 GridPoint.py:78(setxc)
    70140    4.570    0.000    8.820    0.000 GridPoint.py:65(setdens)
     4676    3.560    0.001   22.100    0.005 GridPoint.py:70(set_bf_amps)
   211900    3.140    0.000    3.140    0.000 CGBF.py:75(pnorms)
   218200    3.110    0.000    3.110    0.000 CGBF.py:68(norm)

Cool. Now turn the calls to S and VWN into vector calls. Turns out
that this is harder than I thought. S is trivial, but VWN is a complex
enough function that it's taking some time.

Down to 2.7 sec/iter, though.

I may have introduced an error, though. The energies I'm getting
aren't as low as they should be. -75.8311 rather than -75.853. Could I
have introduced an error? It could just be the random variation in the
grids. I should take out the random part, so that I'm at least getting
the same results every time, and can tell when I do something that
breaks the code.

Seems to be better now that I moved back to an older version of the
code. Taking 4.8 sec/iter, now, and I don't know what has caused the
increase. Looks like it's the additional call to VWN1.

Removing the random spin feature for the anggrids: final energy was
-75.8439 and the time per iter was 6.4, then 4.75, then 4.23?

Got the time back to 2.6. 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    52975   61.290    0.001   80.440    0.002 CGBF.py:128(coulomb)
   199584   11.030    0.000   11.030    0.000 PGBF.py:127(amp)
        9   10.200    1.133   10.370    1.152 hartree_fock.py:107(getJ)
    41013    9.060    0.000   11.930    0.000 DFunctionals.py:69(vwn_eps)
   118800    9.000    0.000   20.030    0.000 CGBF.py:122(amp)
     4809    8.170    0.002    8.170    0.002 Numeric.py:229(concatenate)
    41013    6.330    0.000    8.600    0.000 DFunctionals.py:76(vwn_deps)
    94158    5.330    0.000    5.330    0.000 Numeric.py:330(dot)
   287091    5.140    0.000    5.140    0.000 DFunctionals.py:67(xx)
     4752    4.100    0.001   24.240    0.005 GridPoint.py:64(set_bf_amps)
   211900    3.320    0.000    3.320    0.000 CGBF.py:73(exps)
   211900    3.220    0.000    3.220    0.000 CGBF.py:70(powers)
   211900    3.180    0.000    3.180    0.000 CGBF.py:69(origin)
   218200    3.170    0.000    3.170    0.000 CGBF.py:68(norm)
   211900    3.170    0.000    3.170    0.000 CGBF.py:75(pnorms)

Not bad -- this might be the best I can do for a while.

2002-07-22

Working on staggering rather than randomizing.

deg/shell:     Energy
0	       -75.8439
2.5	       -75.8129
5	       -75.8495
10	       -75.8347
45	       -75.8368
90	       -75.8439 (not surprising, null operation)
90*	       -75.8439
5*	       -75.8571
5.1*	       -75.8359 (bizzare, very ill-conditioned)

* mod(i,4) determines whether not to rotate (0), rotate x by 90 (1),
  rotate y by 90 (2), or rotate z by 90 (3)

So I'm going with 5*. But I shouldn't kid myself about the grids being
converged at this point.

Released PyQuante-1.0, and set up a PyQuante-users mailing list.


Attempted to do a simple DIIS routine. Ignored the stuff about
canonical orbitals, and just averaged DF-FD for AOs. Well, it isn't
working. Reasons? (1) Ignoring COs, (2) Outside of the radius of
convergence. I played around a bit trying to get into the radius of
convergence, but never made it. Looks like I have to read Pulay's
paper a bit more carefully and try again with COs.


2002-07-26

I'm still having problems with DIIS. I'm currently running He by just
computing the error vector and returning. The max value of the error
vector only gets as low as 19, even when the matrix is converged.

For some reason, He appears to work properly -- the error matrix
starts at 0.5 and goes down to 0.04 over ~10 iterations. However, the
convergence doesn't work.

I tried using the whole error matrix rather than the lower
triangle. No improvement. I also checked out the code for the
similarity transformation and traceproperty and it looks alright.
No idea.

Spent some time rereading Pualy's papers on DIIS. He derives
everything starting from a Davidson-type way of approximating the
Hessian. If that is right, though, it doesn't explain why the energy
ever goes up.

Also spent some time on DDJohnson's convergence routines, none of
which I managed to get working.

Finally, I spent some time figuring out the fastest way to do the
traceproperty calculation TP(A,B) = sum_ij A_ij*B_ij. There are 3 ways
to do this. Here are the results averaged over 10 runs with random
1000x1000 matrices:

sum(diagonal(matrixmultiply(A,B)))          178
sum(sum(A*B))                               0.65
dot(reshape(A,(n*m,)),reshape(B,(n*m,)))    0.057

It appears that the results aren't exactly the same, though, so I'm
sticking with the first one even though it's much slower.

2002-07-27
Still don't understand why the methods give different results. I
thought they were supposed to be the same. Still, I've found a method
that is almost as fast as the pure dot and gives the same results as
the trace-mm:

For 10 runs of size n = 200 
  trace-mm 100050.080  1.777 
   sum-sum 100205.585  0.178 
       dot 100205.585  0.022 
      loop 100205.585  4.021 
fast-trace 100050.080  0.082 

For 10 runs of size n = 500 
  trace-mm 624492.830 138.046 
   sum-sum 624389.675  1.139 
       dot 624389.675  0.186 
      loop 624389.675 31.340 
fast-trace 624492.830  0.591 

2002-07-29
Have had no luck getting convergence acceleration to work. Perhaps
rather than using sqrt<n-m|n-m> to determine the difference between
two density matrices, we should think about <n-m|S|n-m> or
sqrt<n-m|S|n-m>?

Using the fact that both n&m should be idempotent and thus <n|S|n> =
<m|S|m> = 1, D = sqrt(2-2<n|S|m>).

Todo:
* Becke projection operators
* Redo the way grids are parameterized and fit parameterization for
  all atoms
* Two-point DIIS, or 2 point D matrix averaging, or maybe full DIIS.
* Can we speed up TraceEnergy by using a dot-product of the 1-d reps
  of the matrices?


2002-08-21

I'm ready for an official release. Sent a prerelease notice to
different people to get enough feedback to make a version 1.0.1 before
I release it everywhere.

Randy Hall suggested that I put a kinetic function in pyints/cints, to
make the computation of this a little easier to understand. The way
that I do this now, where PGBF makes subsequent calls to
overlap_kernel, is confusing. Having separate overlap and kinetic
functions would be preferable.

Jeremy Kua suggested semiempirical techniques.

Konrad Hinsen suggested making this a package.

2002-09-03
Made some of the changes that Konrad suggested. The code is now a
proper package. Working on the other suggestions.

Wrote the pure python code to compute the kinetic energy. Testing on
h2o. 

Old code:
nbf =  25
int time =  21.9047989845
Final energy = -76.005909
scf time =  13.8176730871


New python code:
nbf =  25
int time =  23.9974490404
Final energy = -76.005909
scf time =  15.2314189672

Works! Also note the difference in CPU time, since the old code uses
the C-based overlap_kernel.

New C code:
In cints_wrap
int time =  21.9297950268
Final energy = -76.005909
scf time =  13.8325129747

Cool.

Since now I'm only using the overlap_kernel routines to do real
overlaps, I've now renamed them 'overlap', which makes more sense.

Checked in code, and rebuilt the source installation.

2002-09-08 
Charlie Schur noted that CGBF:coulomb has
    return a.norm()*a.norm()*a.norm()*a.norm()*Jij
when it should have
    return a.norm()*b.norm()*c.norm()*d.norm()*Jij

How did I miss this? 

I wrote a test suite to keep things like this from hapenning again. It
looks like everything is alright, although it seems that correcting
the bug didn't change anything (which I don't understand).

The test suite is called "TessSuite" and is in the tests directory.

Schur also asked for more documentation of the techniques for the XC
functionals and the DFT grid in the modules.


2002-09-09
Think I've fixed Moritz Braun's problem. Turns out it was based on a
glibc problem that occurred when python went from libc to glibc.

On py1.5.2, 1.e-310**2 works, on py2.2 it underflows.

I put in the command
>>> where(dens < 1.e-10,0,dens)
to cut off small values of the density.

Also put in a comment about the provenence of the VWN functional

Also brushed up the test suite a bit:
Times:
 OS X: 722.6 sec
 Linux/Dell12: 265.4 sec

Removing the NO/FD from the test suite, since I think it is unstable.

Downloaded v1.0.3 and installed/tested it on linux -- it works.


2002-09-26
Test suite on OS X
Using CINTS: 443.66 sec

Reports from Patrick Nichols at University of New Orleans suggest that
the RYS code has a bug with larger basis sets.

Compiled/installed new rys code from PNichols. Can't test, because of
a problem with importing multiple symbols. This is related to a
problem that Konrad suggested a while back:

>There is one remaining portability problem: the C modules. They work
>fine on Linux, plus a few other platforms, but not everywhere. The
>problem is that all three C modules link the file util.c, so the
>exported symbols of that file will be present in memory three times if
>all modules are imported. On some platforms, that is not allowed.
>
>In fact, for optimal portability, all C symbols except for the module
>initialization functions should be "static". It's not always possible,
>but in any case the number of externally visible C symbols should be
>as small as possible.

Removed the link dependence of the util.c routines and rebuilding to
see what goes wrong:

cints needs: binomial_prefactor, fact_ratio2, dist2,
product_center_1D, Fgamma, fact2, fact

chgp needs: product_center_1D, dist2, Fgamma

crys needs: product_center_1D, dist2, binomial

satisfied crys,chgp,cints

Testing again. Didn't work. Looks like I have to move this around and
make a single module.

I tried two easy things: (1) simply defining everything in util
static. I don't think I did this right, because I got some comments
like "module xxx declared static but never defined" or some such.
(2) tried putting the individual routines into the different modules,
but this didn't work either.


Backed out all of my changes, and rerunning the diagnostics to make
sure that everything worked.

2002-09-27
Looks like if I define mods as static I have to put all of the
functions in the same file. Poked around the modules in the Python
distribution and it appears that these are all made that way. Each
function in that file should be static.

Playing around with the python -O option. The test suite now takes 466.
Guess it didn't change anything.

2002-09-28
Fixed the bug with the statically defined functions. Decided to define
each of the C modules as a single file, so combined all of the
appropriate functions appropriately.

Ran the test suite using the crys module:
with crys runs in 379 sec.

Confirmed that the new crys module works.

Set to release these files as PyQuante-1.0.4.

2002-09-29
Released files. Tested on Linux, found a bug in the c files, a
reference to util.h. Fixed and confirmed the tests work. Files
officially released now.

2002-10-01
Celebrating a release by doing some work. Working on convergence
acceleration.
Simple averaging results: (Converging h2o(1\AA,90deg) to 10^-5)
Alpha  #it
1      18
0.75   11
0.5    15
0.25   20 (oscillated?!)

Anderson averaging 
Alpha  #it
1.0    26 slow, but well-behaved, a few values below 0
0.75   dnc
0.5    dnc (oscillated, values below 0)
0.25   dnc ditto

Put cutoffs of [0.1,1.0] on beta to see if this works better. Also
took out the alpha parameter, since I didn't find it doing anything.
Took 16 iterations.

Changed cutoffs to [0.01,1.5], which made the results oscillate before
converging in 14 iterations.

I'm uncomfortable with the Anderson averager because (1) the fact that
values < 0 are cropping up seems wrong, and (2) it doesn't appear
particularly stable, and (3) I don't see any advantage over simple
averaging.

Working on DIIS. Something strange is happening when I compute the
error matrix values: When the wfn is converged, the max err is still
20.3! Must be doing something wrong.

2002-10-03
Still trying to figure out what's wrong with the max err calculation. 
Ran the following tests from within Convergence.py:test_conv.py
    X = SymOrth(S)
    A = SimilarityTransform(S,X)
    mtx2file(A,'XtSX.dat')
This gives the Identity matrix I
    A = matrixmultiply(transpose(X),X)
    mtx2file(A,'XtX.dat')
Neither Diagonal nor I.
    A = matrixmultiply(X,transpose(X))
    mtx2file(A,'XXt.dat')
Diagonal but not I

This is very wrong!

Ran some simple tests: 
val,vec = Heigenvectors(S)
b = matrixmultiply(vec,transpose(vec))
c = matrixmultiply(transpose(vec),vec)

both b and c are identity (as they should be).

Must be doing something wrong in SymOrth that's breaking the
Unitariness of X.

Starting to think about using LinearAlgebra:cholesky_decomposition for
the generalized eigenvalue problem.

If our problem is 
  Fc = ScE
and we can decompose
  S = LLt
then
  C = L^-1F(L^-1)t
  L-1S = L-1LLtcE = LtcE
and thus
  C(Ltc) = E(Ltc)
where the eigenvalues are the same as for F, and the eigenvectors are
Ltc.

2002-10-04
Found a confusion in LA2.SymOrth: I was confusing the canonical
transformation method of orthogonalizing the orbitals with the
symmetric orthogonalization method of orthogonalizing the
orbitals. Both now work. I'm implementing the SymOrth one

I was also confused about XXt = XtX = I. This is not generally the
case with either the symmetric or the canonical orthogonalizations.

I'm bothered by the definitions of SimilarityTransform (XHXt) and
SimilarityTransformT (XtHX), which seem to be the opposite of what
they should be. However, this may just be a result of the stupid way
that Numpy defines matrices.

However, this does suggest why DIIS wasn't working: I had assumed that 
X(FDS-SDF)Xt = F'D'-D'F', but this isn't the case. Doesn't explain why
GVB-DIIS worked, though.

With this correction, DIIS is working much better now. But it always
gives energies that are too low (i.e. lower than the correct answer)
during the convergence. Something is definitely wrong.

Here's the convergence, along with the coefficient vector c:
dell12(PyQuante)418 % python Convergence.py
Starting DIIS: Max Err =  0.844120204135
[  1.          13.14745016]
1 -68.3247857593
[ 0.37764541  0.62235459  2.12088478]
2 -66.5869035049
[-0.10782951  0.14868808  0.95914143  0.23492074]
3 -78.0855182461
[ 0.0190478   0.00170012 -0.37228985  1.35154193  0.00352612]
4 -76.3368947949
[  8.83942974e-04  -2.81958491e-03   5.18111568e-02  -1.04526309e-01
        1.05465079e+00   1.99839722e-04]
5 -75.9395304695
[ -5.95994120e-04   2.30863570e-04  -2.85576802e-03   1.12016619e-02
       -4.32735259e-01   1.42475450e+00   7.63319889e-06]
6 -76.0027734256
[  2.39564405e-05   1.81617664e-04  -1.61070843e-03  -1.13883977e-03
        5.51677984e-02  -2.53729517e-01   1.20110569e+00   6.21831038e-07]
7 -76.0152518101
[  1.77235798e-05  -5.45575756e-06   7.20332022e-04  -1.49601670e-03
        1.40195148e-02  -1.93396087e-02  -3.62938428e-01   1.36902194e+00
        2.89551341e-08]
8 -76.0121583637
[ -6.73887828e-07  -4.13856594e-06  -7.68775073e-05   4.29879021e-04
       -5.19940893e-03   1.04695714e-02   4.41650329e-02  -3.11764495e-01
        1.26198111e+00   2.36170939e-10]
9 -76.0116514703
[ -5.94828907e-07  -3.14929047e-08   4.35438540e-05  -1.89999767e-04
        2.07353806e-03  -3.87667942e-03  -1.92046255e-02   1.27739610e-01
       -5.28894327e-01   1.42230957e+00   8.52591549e-12]
10 -76.0117489773
[  1.43365913e-07  -8.09996479e-09  -5.98113535e-06   2.37057056e-05
       -3.00222612e-04   6.76562009e-04   1.97978259e-03  -1.60603182e-02
        6.72272797e-02  -3.04140950e-01   1.25060001e+00   6.71820602e-14]
11 -76.0117595853
[ -1.09519686e-08  -6.30983614e-09   4.22260193e-07  -1.04680179e-06
        2.69350595e-05  -7.92030589e-05  -8.17298292e-05   1.23193722e-03
       -5.17189830e-03   4.12120862e-02  -2.19168764e-01   1.18203128e+00
        9.56323403e-16]
12 -76.011758993

This may simply be an artifact of the small radius of convergence of
the DIIS method, although I can't say that I understand what's going
on. The problem seems to be minimized when I set the errcutoff value
to a lower value, e.g. 0.5.

I'm also bothered by the fact that there are negative coefficients of
the fock matrices. What if I take the absolute value of every element
in the matrix A used in the least-squares fit? Will that force each of
the coefficients to be positive?

Nope, this makes the convergence much worse.

Tried some other tricks from gvbdiis: (i) the factor of 4 before the
elements of A, (ii) changing the -1 to 1. No difference.

2002-10-31
Started a new module, Ints.py, to contain many of the functions from
hartree_fock.py like getbasis and getints that had nothing special to
do with hartree_fock calculations. After I moved the routines I spent
a long time rerunning the test suite to insure that I fixed everything
I broke. It could be that Util might be a better name than Ints, but
most of the functions here are integral related.

Started a new module, Quest.py, to contain a Quest-like LDA, which
should construct the Hamiltonian in linear time.

Started new modules MNDO.py and AM1.py to contain an implementation of
Dewar's MNDO and AM1 techniques.

2002-11-09
Working on MINDO3.py. Comparing results to Mopac7 for 90deg
water. Currently I have working the core hamiltonian, and it looks
like the diagonal part of the iterative hamiltonian. The energies
match before the first diagonalize -- but the density matrix is
diagonal here, and I'm not testing the offdiag parts. 

2002-11-11
Getting most of the functions in MINDO3 to work. H atom works, for
both the E and the Hf terms. h2o gives a similar energy (-481 instead
of -476) and I assume the difference is because I'm using a different
overlap term. Not sure how to adjust the heats of formation to make
this all work out.

Removed MNDO.py and AM1.py -- I think that MINDO is something of a
Pauling point for semiempirical methods, and I don't much like the
coulomb corrections that the nddo methods use.

H atom works, B atom works, O atom works. 

I'm still getting the bonding wrong for h2o. This may be due to the
overlap term. I should either renew my efforts to reproduce the slater
term, or I should get a different scaling function.

2002-11-12
Working on the Slater overlap. Got routines bfn and ss working, now
need to figure out how to build the overlap from ss.

Decided to throw out all of that stuff, and base my approach on a
Gaussian expansion of the Slater functions (Stewart, JCP 52, 431
(1970)). This is the work referenced in setupg/gover functions of
Mopac, although I rewrote the function from scratch. These functions
are now in Slater.py. These functions are what Mopac uses when it
needs to take derivatives of the functions, which I'll need to do
eventually.

For h2o, I get:
    Eel = -476.932 024 eV
    Hf = -49.143 422 kcal/mol
Mopac gives:
    Eel = -476.932 77 eV
    Hf = -48.828 49 kcal/mol

For oh I get:
    Eel = -387.323837971
    Hf = 17.8148594438
Mopac gives:
    Eel = -387.326 16 eV
    Hf = 18.08088 kcal/mol

I think this is close enough for now. I'm going to start looking into
computing the forces now.

The differences are most likely due to either a later correction that
Mopac makes, or due to a truncation error. In either case, the
difference is small, and I'm not going to worry about 0.2 or 0.3
kcal/mol.

(Later) Fixed a bug which corrected most of the above
discrepancies. There was an ev2kcal constant that was 23.06 rather
than 23.061, and that small difference accounted for the difference
with Mopac.

2002-11-13
Did some housecleaning of the code. Tried to pull all of the parameter
specific stuff into the initialize routine. Also introduced the Atom
and the Bunch modules. Atom is a generic container for atoms, Bunch is
a generic container for anything that doesn't fit into other
containers; I'm using it here for Basis functions.

2002-11-14
Put the h2o and oh test programs into the test suite.

2002-11-17
Added some routines to simplify the program:
Minimizers.py has a simple SteepestDescent program
Bunch.py is a generic container object
Atom.py is an object to hold atom objects

2002-11-18
Added CIS.py, which currently is a naive implementation of CI-singles
(scaling as N^5-N^6), but which ultimately will be replaced with the
Foresman, Head-Gordon, and Pople approach, which scales as only
O(N^4).

Doing some basic testing. For H2 I get Ecis = 
[-0.55917495 -0.24685179  0.98250093  1.90966663  2.18051587  2.56957927
       3.32261163  4.04053978  4.21071413]
using the N^5 integral transformation, and Ecis = 
[ 0.43022164  1.08983795  1.29963302  2.17256997  2.17256997  2.22880735
       2.6624887   2.6624887   3.61130629]
The latter one makes much more sense (negative excitation energy?).

Jaguar gives:
-1.088 h for the closed-shell energy
-0.885 h for the triplet state
-0.914 h for the open-shell singlet (ihamtyp=2)

Gamess gives:
-1.088 h for the cshf energy and the CI ground state
-0.653 h for the CI excited state
+0.017 h for the second excited state
+0.197 h for the third excited state

PyQuante gives:
-1.088 for the closed shell energy
-1.088 + 0.430 = -0.658 for the first excited state?! Is this right?
-1.088 + 1.089 = 0.001 for the second excited state
-1.088 + 1.300 = 0.212 h for the third excited state

Looks like I'm in the right ballpark, at least with the
TransformIntsSlo routines. Something still wrong in the other.

Compare the CI vectors between the two cases
Here are the GAMESS vectors:
 STATE #    1  ENERGY =      -1.089903570
        1    1.000000  2000000000
 STATE #    2  ENERGY =      -0.653007343
       10    0.999530  1100000000
 STATE #    3  ENERGY =       0.017092451
        8    0.130220  1001000000
        9    0.990101  1010000000
 STATE #    4  ENERGY =       0.196583576
        8    0.989855  1001000000
        9   -0.127297  1010000000

Looks like I'm not getting the vectors correct, and that I'm just
getting out the diagonal values of the CI matrix, which are pretty
close to the correct values. Any stabilization that comes from mixing
these states would be lost. Thus, I get state 2 nearly correct, since
it's a pure state, but make bigger mistakes in states 3 and 4, that
have a small amount of mixing.

Also working on the force optimization. Here are the forces that mopac
calculates on warped ch4 in dcart/analyt:
     1     6    52.426099    52.426099     0.000000
     2     1  -179.708874    22.335532     0.000000
     3     1    22.335532  -179.708874     0.000000
     4     1    52.473621    52.473621  -127.945763
     5     1    52.473621    52.473621   127.945763
Here are the numeric forces that I compute:
(52.387421701496351, 52.38742167421151, -9.0949470177292824e-09)
(-179.71367318750708, 22.323410257740761, 9.0949470177292824e-09)
(22.323410262288235, -179.71367318750708, 0.0)
(52.501420955195499, 52.501420946100552, -127.92967916539055)
(52.501420946100552, 52.501420955195499, 127.92967915629561)

So we're at least on the same page. Now I just have to figure out
exactly what goes into these terms and reverse engineer it.

2002-11-20
In the process of reverse engineering the forces. There are three
terms to the force components: The nuclear repulsion, the
one-electron, and the two-electron. Mopac calls these TERMNC, TERMAB
and TERMAA. I get the nuclear repulsion term correct, the one-electron
term a little bit off, and the two-electron term completely off.

It's actually worse than that: Most of the individual elements that go
into termaa and termab are completely wrong. In particular, it looks
like cgbf.doverlap isn't computing the proper derivatives.

Got everything working, but I had to use ders's overlap derivatives
rather than my own. Would be nice to figure out the difference.
In any case, here's what I get:
[  5.39620039e+01   5.39620039e+01   2.84217094e-14]
[ -1.79877268e+02   2.22525122e+01   5.68434189e-14]
[  2.22525122e+01  -1.79877268e+02   6.57252031e-14]
[  51.83137589   51.83137589 -127.58431674]
[  51.83137589   51.83137589  127.58431674]
which is more or less correct.

Also created new modules Constants.py (which holds various conversion
units and other constants) and Elements.py (which holds data related
to elements, such as mass, atomic numbers, etc.).

2002-11-21
Profiling the Force calculations:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6165   29.810    0.005   45.500    0.007 MINDO3.py:511(mopac_doverlap)
   984960   10.610    0.000   10.610    0.000 PGBF.py:78(exp)
     4176    7.600    0.002   14.510    0.003 CGBF.py:99(overlap)
   744552    7.460    0.000    7.460    0.000 PGBF.py:82(coef)
   150336    3.960    0.000    3.960    0.000 PGBF.py:88(overlap)
       10    3.220    0.322    3.540    0.354 MINDO3.py:177(get_F2)

Need to rederive and speed up the mopac_doverlap routine, since it
takes most of the time. I think the problem with my routine is that I
took the derivative of a basis function and then took the overlap of
it, rather than directly taking the derivative of the overlap.

Tried to speed this up by pulling the big if loop out of the inner
part of the loop over primitives, but it didn't change the timing at
all.

2002-11-25
Getting close to figuring out the source of the errors in the doverlap
calculation. Wrote a test suite that would compare the mindo to the
cgbf doverlaps for rotating h2:

ang  ------mindo3------   --------cgbf------
 0   0.60   0.00   0.00    0.63  -0.00  -0.00
10  -0.50  -0.32   0.00   -0.53  -0.34  -0.00
20   0.24   0.54   0.00    0.26   0.58  -0.00
30   0.09  -0.59   0.00    0.10  -0.62  -0.00
45   0.31   0.51   0.00    0.33   0.54  -0.00
55   0.14  -0.58   0.00    0.15  -0.61  -0.00
90  -0.27   0.53   0.00   -0.28   0.56  -0.00

so I'm only off by 5% or so. Here's what c2 looks like:
 0   0.61   0.00   0.00    0.64  -0.00  -0.00
10  -0.51  -0.33   0.00   -0.54  -0.35  -0.00
20   0.25   0.55   0.00    0.26   0.59  -0.00
30   0.09  -0.60   0.00    0.10  -0.64  -0.00
45   0.32   0.52   0.00    0.34   0.55  -0.00
55   0.14  -0.59   0.00    0.15  -0.63  -0.00
90  -0.27   0.54   0.00   -0.29   0.58  -0.00

Here are the forces on distorted ch4 with MINDO3 routines:
[  5.39620039e+01   5.39620039e+01   1.42108547e-13]
[ -1.79877268e+02   2.22525122e+01  -2.84217094e-14]
[  2.22525122e+01  -1.79877268e+02  -1.42108547e-14]
[  51.83137589   51.83137589 -127.58431674]
[  51.83137589   51.83137589  127.58431674]

and here are the forces using cgbf doverlap:

[  3.66922888e+01   3.66922888e+01   1.42108547e-14]
[ -1.67536058e+02   2.27383823e+01  -7.46069873e-14]
[  2.27383823e+01  -1.67536058e+02   5.32907052e-15]
[  54.05269363   54.05269363 -112.85913699]
[  54.05269363   54.05269363  112.85913699]

and here are the forces using doverlap_num, a numerical
differentiation.
[  3.66923701e+01   3.66923701e+01   1.42108547e-14]
[ -1.67536119e+02   2.27383800e+01  -7.63833441e-14]
[  2.27383800e+01  -1.67536119e+02   5.32907052e-15]
[  54.05268432   54.05268432 -112.85919323]
[  54.05268432   54.05268432  112.85919323]

so for some reason I'm off by more here.

Distorting further by giving the C2 molecule some z character:

 0  0.6043  0.0000  0.0604   0.6396 -0.0000  0.0640
10 -0.5071 -0.3288  0.0604  -0.5367 -0.3480  0.0640
20  0.2466  0.5517  0.0604   0.2610  0.5839  0.0640
30  0.0932 -0.5971  0.0604   0.0987 -0.6319  0.0640
45  0.3175  0.5142  0.0604   0.3360  0.5442  0.0640
55  0.1415 -0.5876  0.0604   0.1497 -0.6218  0.0640
90 -0.2708  0.5403  0.0604  -0.2866  0.5718  0.0640

Here's a bit more detailed example, again of C2:
ang -------mindo-------  ------cgbf---------   -----cgbf_num-------
 0  0.604  0.000  0.060   0.640 -0.000  0.064  -0.640  0.000 -0.064
10 -0.507 -0.329  0.060  -0.537 -0.348  0.064   0.537  0.348 -0.064
20  0.247  0.552  0.060   0.261  0.584  0.064  -0.261 -0.584 -0.064
30  0.093 -0.597  0.060   0.099 -0.632  0.064  -0.099  0.632 -0.064
45  0.317  0.514  0.060   0.336  0.544  0.064  -0.336 -0.544 -0.064
55  0.141 -0.588  0.060   0.150 -0.622  0.064  -0.150  0.622 -0.064
90 -0.271  0.540  0.060  -0.287  0.572  0.064   0.287 -0.572 -0.064

where the last triad represents numerical differentiation of the
overlap. Whatever mistake I'm making in doverlap, I also am making in
doverlap_num.

The questions I have:
1. If doverlap is wrong, why does it agree with doverlap_num so
exactly?
2. If mopac_doverlap is wrong, why does it agree with the MOPAC
results so well.
3. If neither is wrong, why don't they agree with each other?
Clearly I'm missing something here.

Here's the Cpx-Cpx overlaps:
 0  0.630  0.000  0.063   0.667 -0.000  0.067   0.667 -0.000  0.067
10 -0.231 -0.835  0.028  -0.244 -0.884  0.029  -0.244 -0.884  0.029
20 -0.151  0.813 -0.037  -0.160  0.860 -0.039  -0.160  0.860 -0.039
30 -0.083 -0.711 -0.054  -0.088 -0.752 -0.057  -0.088 -0.752 -0.057
45 -0.125  0.869 -0.024  -0.133  0.920 -0.025  -0.133  0.920 -0.025
55 -0.118 -0.735 -0.050  -0.125 -0.778 -0.053  -0.125 -0.778 -0.053
90  0.147  0.833 -0.033   0.156  0.881 -0.035   0.156  0.881 -0.035

Ah found it. All of the values were off by 1.058, which equals
2*0.52918. I had put in an additional factor of 2 to make the numbers
closer. The bohr radius is converting from h/bohr to h/A.

Here are the forces now on distorted CH4:
[  5.39620039e+01   5.39620039e+01  -4.26325641e-14]
[ -1.79877268e+02   2.22525122e+01   5.86197757e-14]
[  2.22525122e+01  -1.79877268e+02   5.50670620e-14]
[  51.8313759    51.8313759  -127.58431672]
[  51.8313759    51.8313759   127.58431672]

This speeds up the force calculations. Previously the profiling for
rdx forces gave:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6165   29.810    0.005   45.500    0.007 MINDO3.py:511(mopac_doverlap)
   984960   10.610    0.000   10.610    0.000 PGBF.py:78(exp)
     4176    7.600    0.002   14.510    0.003 CGBF.py:99(overlap)
   744552    7.460    0.000    7.460    0.000 PGBF.py:82(coef)
   150336    3.960    0.000    3.960    0.000 PGBF.py:88(overlap)
       10    3.220    0.322    3.540    0.354 MINDO3.py:177(get_F2)


Here's what my analytic forces look like. I'm not sure this is an
improvement.
861840/7695   27.350    0.000   56.530    0.007 copy.py:152(deepcopy)
     6165   17.440    0.003   91.700    0.015 CGBF.py:135(doverlap)
   427356   11.850    0.000   11.850    0.000 PGBF.py:88(overlap)
53865/7695   10.820    0.000   54.830    0.007 copy.py:242(_deepcopy_dict)
   854712    9.050    0.000    9.050    0.000 PGBF.py:82(coef)
     4176    7.100    0.002   14.410    0.003 CGBF.py:99(overlap)
   406260    6.200    0.000    6.200    0.000 copy.py:252(_keep_alive)
    30780    4.080    0.000   47.450    0.002 copy.py:215(_deepcopy_list)
53865/7695    3.570    0.000   55.960    0.007 copy.py:268(_deepcopy_inst)
       10    3.110    0.311    3.510    0.351 MINDO3.py:179(get_F2)
   252360    2.740    0.000    2.740    0.000 copy.py:193(_deepcopy_atomic)
   221940    2.330    0.000    2.330    0.000 PGBF.py:78(exp)
    15390    1.770    0.000    4.590    0.000 copy.py:223(_deepcopy_tuple)

Here's what the numeric forces look like:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1380960/12330   42.570    0.000   89.550    0.007 copy.py:152(deepcopy)
86310/12330   17.670    0.000   87.100    0.007 copy.py:242(_deepcopy_dict)
   594216   16.660    0.000   16.660    0.000 PGBF.py:88(overlap)
     6165   15.130    0.002  124.090    0.020 CGBF.py:187(doverlap_num)
   650340   10.060    0.000   10.060    0.000 copy.py:252(_keep_alive)
   744552    7.920    0.000    7.920    0.000 PGBF.py:82(coef)
     4176    6.750    0.002   14.540    0.003 CGBF.py:99(overlap)
86310/12330    6.390    0.000   88.710    0.007 copy.py:268(_deepcopy_inst)
    49320    6.260    0.000   75.190    0.002 copy.py:215(_deepcopy_list)
   403740    3.970    0.000    3.970    0.000 copy.py:193(_deepcopy_atomic)
       10    2.990    0.299    3.490    0.349 MINDO3.py:179(get_F2)
    24660    2.630    0.000    7.050    0.000 copy.py:223(_deepcopy_tuple)
    12330    1.400    0.000    2.380    0.000 CGBF.py:129(move_center)
    73980    0.980    0.000    0.980    0.000 PGBF.py:117(move_center)

I'm going to work from the analytic forces and try and speed up
things. Working on ch4 for a while, since it's faster:

  8736/78    0.220    0.000    0.540    0.007 copy.py:152(deepcopy)
   546/78    0.190    0.000    0.520    0.007 copy.py:242(_deepcopy_dict)
       66    0.140    0.002    0.850    0.013 CGBF.py:135(doverlap)
     4680    0.110    0.000    0.110    0.000 PGBF.py:88(overlap)

Here's the sped up version. I inlined the call to cints.overlap so I
didn't have to do any deepcopies:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       66    0.460    0.007    0.790    0.012 CGBF.py:135(doverlap)
     9360    0.170    0.000    0.170    0.000 PGBF.py:82(coef)
     7992    0.110    0.000    0.110    0.000 PGBF.py:78(exp)
     5616    0.080    0.000    0.080    0.000 PGBF.py:79(origin)
       52    0.070    0.001    0.200    0.004 CGBF.py:99(overlap)
     1872    0.060    0.000    0.060    0.000 PGBF.py:88(overlap)

Here's what RDX looks like:
     6165   43.730    0.007   71.530    0.012 CGBF.py:135(doverlap)
   854712    8.610    0.000    8.610    0.000 PGBF.py:82(coef)
   775980    7.740    0.000    7.740    0.000 PGBF.py:78(exp)
     4176    6.700    0.002   13.460    0.003 CGBF.py:99(overlap)
   554040    5.650    0.000    5.650    0.000 PGBF.py:79(origin)
   554436    5.620    0.000    5.620    0.000 PGBF.py:81(norm)
   150336    3.720    0.000    3.720    0.000 PGBF.py:88(overlap)
       10    2.980    0.298    3.390    0.339 MINDO3.py:179(get_F2)
   277020    2.920    0.000    2.920    0.000 PGBF.py:80(powers)

really reduced the overhead as well. Probably still too slow,
though. Here's the numeric version:
     6165   52.500    0.009   84.130    0.014 CGBF.py:164(doverlap_num)
   887760    8.870    0.000    8.870    0.000 PGBF.py:80(powers)
   887760    8.700    0.000    8.700    0.000 PGBF.py:78(exp)
   744552    7.790    0.000    7.790    0.000 PGBF.py:82(coef)
     4176    6.680    0.002   13.850    0.003 CGBF.py:99(overlap)
   444276    4.600    0.000    4.600    0.000 PGBF.py:81(norm)
   443880    4.500    0.000    4.500    0.000 PGBF.py:79(origin)
   150336    3.970    0.000    3.970    0.000 PGBF.py:88(overlap)
       10    3.040    0.304    3.500    0.350 MINDO3.py:179(get_F2)

2002-12-12
Summarizing new features for an upcoming 1.0.5 release. Here are the
new features:
* Added a MINDO3.py module, which performs semi-empirical calculations
  using Dewar's MINDO/3 Hamiltonian
* Added a Dynamics.py module to run molecular dynamics. Currently
  only the MINDO3.py module supplies forces.
* Added a Minimizers.py module with a steepest descent minimizer
  currently resides. 

Additionally, there are several subjects that have begun and are not
finished:
* Convergence acceleration
* Configuration Interaction Singles
* Quest linear scaling

2002-12-13
Confirmed that PyQuante-1.0.5 works on Linux. Going to distribute now.
Confirmed that the PyQuante-1.0.5 downloaded from the website runs the
test suite.

UGH! I'm inconsistent with where I'm putting the IO readers. I have
both included these in IO.py, and I also put XYZ.py in there. I should
remove XYZ.py and standardize on the version in IO.py.

2002-12-17
Playing with CIS whlie I have a few minutes to kill

Gamess gives:
-1.088 h for the cshf energy and the CI ground state
-0.653 h for the CI excited state
+0.017 h for the second excited state
+0.197 h for the third excited state

PyQuante gives:
-1.088 h for the closed shell energy
-0.645 h for the first excited state
+0.017 h for the second excited state
+0.197 h for the third excited state

These are computed using the slow integral transformation routine.
The reason I'm doing so well is these energies come without any
mixing:

0 0 (0, 1) 0.999704611586
1 1 (0, 2) 0.99996719908
2 2 (0, 3) 0.999651572511

Gamess has:
 STATE #    1  ENERGY =      -1.089903570
        1    1.000000  2000000000
 STATE #    2  ENERGY =      -0.653007343
       10    0.999530  1100000000
 STATE #    3  ENERGY =       0.017092451
        8    0.130220  1001000000
        9    0.990101  1010000000
 STATE #    4  ENERGY =       0.196583576
        8    0.989855  1001000000
        9   -0.127297  1010000000
so I'm missing out on the mixing.

...and it's understandable why, since my CI matrix is essentially
diagonal.


2003-04-07 Haven't really worked on the code in a while. Having
children does that, I guess. Trying to work on convergence
acceleration first.

Geometry: h2o, 90 deg + 1 Ang

Simple averaging: 
1.0    -76.0118	  18
0.75   -76.0118	  11
0.5    -76.0118	  15
0.25   -76.0117	  20

DIIS:  -76.0118, 12 iterations (DIIS started at #4)
What is worrysome is that several of the intermediate cases had lower
energies than this: #7 had -76.057 and #8 had -76.021

Jaguar claims that the correct energy is -76.0117 (which they get
after only 8 iterations, although their first iteration is -75.755, so
their guess is a lot better).

The energies that are lower than the correct one all come from
iterations where there were negative coefficients. 

Default DIIS case:
1 -68.3247857593
2 -71.2574774349
3 -73.6750387034
4 -74.6774894422
Starting DIIS: Max Err =  0.384793592311
5 -75.4153908173
6 -74.1099858518
7 -76.0566582732
8 -76.0214206227
9 -76.0110709464
10 -76.0114862219
11 -76.0117545604
12 -76.0117547638

Zeroing out negative coefficients:
1 -68.3247857593
2 -71.2574774349
3 -73.6750387034
4 -74.6774894422
Starting DIIS: Max Err =  0.384793592311
5 -75.4153908173
6 -74.1099858518
7 -79.6853982252
8 -86.0566942848
9 -106.045188842

Clearly a bad idea...I think because it changed the overall magnitude
of the sum of the elements. 

Idea 2: zero the negative elements and renormalize:
1 -68.3247857593
2 -71.2574774349
3 -73.6750387034
4 -74.6774894422
Starting DIIS: Max Err =  0.384793592311
5 -60.6675779832
6 -73.3561552221
7 -75.9970292431
8 -75.9925541998
9 -76.0158197863
10 -76.0143359202
11 -76.0125909574
12 -76.0120954495
13 -76.0114471473
14 -76.0117474592
15 -76.0117942691
16 -76.0117681938
17 -76.0117787514
18 -76.0117621301
19 -76.0117579592

Not a good solution...

Guess I'm going to stick with the default option.

I felt that DIIS was working well enough to put it into all of the
software. Here are the results for the test suite:

Case	  Old		       New
-----	  ------------	       --------------
H2	  -1.082097/3	       -1.082098/4
He	  -2.855156/2	       -2.855609/3
He/DFT	  -2.826639/12	       -2.826698/4
H2O	  -76.005912/9	       -76.011751/8
H2O/DFT	  -75.876804/16	       -76.876901/6
Ne	  -128.474391/8	       -128.474406/7
----------------------------------------------
time	101.3 sec		80.42 sec

2003-04-08
Making progress with CIS:
Gamess has:
 STATE #    1  ENERGY =      -1.089903570
        1    1.000000  2000000000
 STATE #    2  ENERGY =      -0.653007343
       10    0.999530  1100000000
 STATE #    3  ENERGY =       0.017092451
        8    0.130220  1001000000
        9    0.990101  1010000000
 STATE #    4  ENERGY =       0.196583576
        8    0.989855  1001000000
        9   -0.127297  1010000000

Here are my results:
State 0, energy    -1.0883
0  --    1.0
State 1, energy    -0.6452
1 (0, 1) 0.999705126298
3 (0, 3) 0.0242809944955
State 2, energy     0.0171
2 (0, 2) 0.999967202847
6 (0, 6) 0.00809896482679
State 3, energy     0.1976
1 (0, 1) 0.0242828605577
3 (0, 3) -0.999652064826
9 (0, 9) -0.010300095731

Looks like this works somewhat, although I'd imagine that my orbitals
are different from what GAMESS uses.

Got the O(N^5) integral transformation working as well.

Getting ready to release v1.1.0. Released but not announced.

2003-04-09
Renamed CIS.py -> CI.py, just in case I want to code up other CI
methods. The TransformInts routine is still here, although one could
make a good argument that it should be somewhere else, such as
Ints.py.

2003-04-11
Started  QMC.py, to hold info for fixed-node QMC calculations.
Started MP.py, for MP perturbation theory calculations.

MP2 works for H2, and should work for all closed shell systems. Not
going to worry about open shell systems just yet.

Running MP2 for CH4. The integral transform is incredibly slow. I'll
have to figure out a way to speed this up sometime.  Probably should
just rewrite in C, although it also isn't too hard to make the inner
loop into a dot-product.

I should also think about breaking Emp2 into electron pair contributions.

Had to kill the CH4 MP2 run since it was too slow.

H2 integral transforms:
No dot products: 10.2 sec.
1 dot prod: 7.344 sec, same energy
3 dot prods: 3.35 sec, same energy
4 dot prods: 1.74 sec, same energy
Looks like I got a factor of 10 for free.

Could probably get a bit more by putting in a matrix multiply for two
of the indices.

It took 420.8 sec for the integral transform for ch4. But I get the
MP2 energy wrong -- by quite a bit, it turns out! 

3.5^x = 420.8/1.74 = 241.8
x = log(241.8)/log(3.5) = 4.4 So the timing is alright.

MP2 Results
Job	JagSCF	JagMP2	PyQSCF	PyQMP2	GMSMP2
H2	-1.0883	-0.0276	-1.0883	-0.0271	
He	-2.8552	-0.0255	-2.8552	-0.0254
LiH	-7.9813	-0.0204	-7.9813	-0.0209	-0.0203
CH4	-40.202	-0.1612	-40.202	-0.1682	-0.1629

So I make a mistake for everything other than h2. Ugh.

Substituted the slow integrals, but got the same result. So the
mistake probably isn't in the integrals.

Ah, fixed the bug. In MP2, to range over the unoccupied orbitals I was
looping over range(nclosed,nvirt), rather than
range(nclosed,nclosed+nvirt). Fixed.

2003-04-12
Reworked the integral transform routine for MP2 so that it only
computes the ones I need, rather than the full transform. This speeds
things up substantially.

Job     PyQMP2		told	tnew
H2      -0.0271		1.347	0.0882
He      -0.0254		0.0955	0.00827
LiH     -0.0209		38.3	1.74
CH4     -0.1682		380	34.4

Very cool.


2004-01-06
Working on accelerating the HGP integrals.  The code is in
PyQuante/Tests/integrals.py.

Here are the times for the H2O/6-31G** ints running on a 867 MHz PB G4 laptop:

CHGP Ints:     160.359099984
CRys Ints:    10.8592420816
CINTS Ints:    26.781692028

Much too slow right now; I had planned to compare the timing to the
pure python routines, work on improving the code there, and then
implement the changes in the C vesions, but the CHGP code is too slow
for this. 

Down shift to H2O/STO-3G on the same machine:
CHGP Ints:     2.42264199257
CRys Ints:    0.400830984116
CINTS Ints:    1.19664001465
HGP Ints:      68.4531810284
Rys Ints:   27.1079589128
Py Ints:   47.6266140938
I've got to be able to speed this up if I'm slower than the slow way
of doing the integrals.


2004-01-07
Strategy is to evolve the hgp routine into a more intelligent form. I
had initially hoped to write an hgp2 routine from scratch, but I was
intimidated by getting the code working.

Currently working from the test routine inside of hgp. Computing the
(px,px|px,px) integral from the water case.

Original hgp integral (time,value): 
1.22837793827 0.880158468466

Modifying to include contr_vrr routine:
1.24885296822 0.880158468466


Here's the profiling:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10287/324    1.730    0.000    3.540    0.011 <stdin>:214(vrr)
     6480    0.460    0.000    0.710    0.000 pyints.py:115(_gser)
    30861    0.450    0.000    0.450    0.000 pyints.py:221(gaussian_product_center)
     6480    0.330    0.000    1.360    0.000 pyints.py:78(Fgamma)
     6480    0.250    0.000    0.250    0.000 pyints.py:85(gammln)
     6480    0.160    0.000    1.030    0.000 pyints.py:99(gamm_inc)
     6480    0.160    0.000    0.870    0.000 pyints.py:104(gammp)
        4    0.020    0.005    3.560    0.890 <stdin>:123(contr_vrr)

Clearly the bulk of the time is spent in vrr calls. What if I simply
use the exising contr_vrr routine to precompute all of the required
values?

The simple way of doing this didn't work, since I call contr_hrr
recursively, which means that I'm computing all of the terms each
recursive call. I'm going to modify contr_hrr to remove the recursive
stuff. 

Modifying contr_hrr:
1.50819897652 0.880158468466 
(sigh)

Here's the profiling:
12312/729    1.870    0.000    4.220    0.006 <stdin>:299(vrr)
    36936    0.660    0.000    0.660    0.000 pyints.py:221(gaussian_product_center)
     7857    0.400    0.000    1.690    0.000 pyints.py:78(Fgamma)
     7857    0.400    0.000    0.670    0.000 pyints.py:115(_gser)
     7857    0.380    0.000    1.050    0.000 pyints.py:104(gammp)

(sigh^2)

The full h2o/sto-3g integral tests with this code now take:
HGP Ints:      97.1718000174

So, after a day of clever coding, I've managed to slow the code down
by 50%. My aim in this was to reduce the number of calls to vrr, by
insuring that contr_vrr was only called once for each appropriate
value. For some reason, this didn't work, and I actually *increased*
the number of calls to vrr by 20% or so.

Go back to the recursive version of contr_hrr, and see if replacing
the recursive vrr with an iterative one helps.


2004-01-20

I've been working on a more clever way of doing the VRR integrals in
hgp. 

Debugging for different input values:

XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (0, 0, 0) (0, 0, 0)
4.37335456733 4.37335456733  (works)

XYZ a,b,c,d:  (1.0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (0, 0, 0) (0, 0, 0)
2.44716366803 2.44716366803  (works)

XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (1, 0, 0) (0, 0, 0)
0.0 0.0  (works)

XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (1, 0, 0) (1, 0, 0)
0.0 0.182223106486  (fails)

Also fails in exactly the same way for the py,py and pz,pz cases,
which is strange, since if it was a simple typo it probably would only
exist for one of them.

XYZ a,b,c,d:  (1.0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (1, 0, 0) (0, 0, 0)
-1.83537275102 -1.41425045812  (fails)


2004-01-21
Reworked one of these cases, and I'm now off in a different way:
XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (1, 0, 0) (1, 0, 0)
M:  0
0.546669320916 0.182223106486
...which is exactly a factor of 3.

Fixed that bug, purely by luck.
XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (1, 0, 0) (1, 0, 0)
M:  0
0.182223106486 0.182223106486

XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (1, 0, 0) (1, 0, 0)
M:  1
0.109333863767 0.109333863767

XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
lmn a,c:  (1, 1, 0) (1, 1, 0)
M:  1
0.00976195211583 0.00976195211583

XYZ a,b,c,d:  (0, 0, 0) (0, 0, 0) (1, 0, 0) (0, 0, 0)
lmn a,c:  (1, 1, 0) (1, 1, 0)
M:  1
-0.00187800627934 -0.00187800627934

Looks pretty good!

Seems to be almost a factor of 2 faster.

Testing the full code:

	Timing        Integral
Old	1.13044500351 0.880158468466
New	0.238894939423 0.880158468466

Overall a factor of 5!

Timing of the overall code for h2o/sto-3g integrals:
(old results)
CHGP Ints:     2.42264199257
CRys Ints:    0.400830984116
CINTS Ints:    1.19664001465
HGP Ints:      68.4531810284
Rys Ints:   27.1079589128
Py Ints:   47.6266140938

(new)
HGP Ints:      30.2865087986

So I've spead up the python integrals so that they're comparable with
the Rys quadrature.


Working on putting the code into chgp.c.

Step 1. Putting in the contr_vrr routine. (mostly for convenience)
 Testsuite works. Total time = 507.5 sec
 Cf. compare to the total time with crys = 160.2

Step 2. The code is starting to work. I was a little wasteful in
implementing scratch space for the integrals. Worry about this later.

Currently the code works for everything except for h2o/hf, h2o/dft,
ne. Looks like the d-orbitals are messing it up.

The timing was slow as well (400 sec).

However, when I try to test out the d-integrals (see
Tests/integrals.py), everything looks like it's alright. At least the
c version matches the python version.



2004-01-22
Ran the h2o/hf test case. I get the correct result with HGP (after
fixing a bug), but it takes 1627 sec.

The CHGP version did *not* give the right answer, and it took 467 sec.

Running a few specific test cases. 
All two-e ints for OH/sto-3g (oriented in all directions) match
All two-e ints for O/6-31G** match except for a few of them; I'm
trying to figure out which ones now.

Looks like (2,4,9,14) is one, which is (px,pz,x2,xz). However,
(9,14,2,4) is done correctly.


Found the bug -- I hadn't set MAXAM high enough, and my assert
statements were being skipped because the distutils compile by default
with the -DNDEBUG flag in place.

h2o now works, but takes 516.5 sec, which is about as slow as it was
before the improvement.

Allocating one gigantic array for the vrr_terms and removing the
malloc/free calls speeds the code to only 48.6 sec!

This is still not as fast as the rys routine (which takes 32.4 sec),
but is at least competitive.

The total time for the test suite is now 125 sec, which is actually a
bit faster than the Rys timing (145.6), for reasons that I don't
understand. But this is enough for me to commit to the HGP algorithm
for all upcoming integrals.

2004-01-23
Here's the profiling for one of the integrals:
      324    0.280    0.001    0.530    0.002 <stdin>:462(vrr)
     1296    0.110    0.000    0.130    0.000 pyints.py:115(_gser)
     1296    0.040    0.000    0.240    0.000 pyints.py:78(Fgamma)
     1296    0.040    0.000    0.170    0.000 pyints.py:104(gammp)
     1296    0.030    0.000    0.200    0.000 pyints.py:99(gamm_inc)
Looks like I might do well to speed up the Fgamma calculations.

Using the MD recursion relationship:
    Fgterms = [0]*(mtot+1)
    Fgterms[mtot] = Fgamma(mtot,T)
    for im in range(mtot-1,-1,-1):
        Fgterms[im]=(2.*T*Fgterms[im+1]+exp(-T))/(2.*im+1)
Seems to speed things up quite a bit, at least in pure python.

Whoops, I may have been timing the crys all along.

TessSweet times:
chgp	  128.3
crys	  92.4	Guess that's still a bit faster

h2o/sto-3g test
hgp/old	  31.802
hgp/new	  23.062
chgp/old  1.127
chgp/new  0.916

Total TessSweet with chgp/new: 108.1; not quite as fast, but perhaps
fast enough to use it over crys. I really don't know why it isn't
faster than Rys.


How many times are vrr called with recursive hrr?
324 (for the simple test case)
The iterative one:
729 (Hmm. Something must be wrong).


2004-01-27
Doing some tests with the calling order:
crys: ijkl	 89.023
      klij	 88.544
chgp: ijkl	 106.14
      klij	 105.98
      ijkl/sw	 106.63
Guess there was no hanging fruit to be gotten from rearranging the
call order.

2004-02-01
There may be some more low-hanging fruit with interchanging the x,y,z
coordinates so as to minimize the recursion in hrr and vrr.

Testing out vrr for (6,0,0)|(0,0,0)
Values:   0.647949576728 0.647949576728
Timings:  0.000955104827881 0.000870943069458

Now trying (0,6,0)|(0,0,0)
Values:   0.647949576728 0.647949576728
Timings:  0.000964164733887 0.000799894332886

Now trying (0,0,6)|(0,0,0)
Values:   0.647949576728 0.647949576728
Timings:  0.00108194351196 0.000892877578735

Looks inconclusive: Run each test 100 times:
Testing out vrr for (6,0,0)|(0,0,0)
Values:   0.647949576728 0.647949576728
Timings:  0.0775139331818 0.085841178894

Now trying (0,6,0)|(0,0,0)
Values:   0.647949576728 0.647949576728
Timings:  0.0772438049316 0.0800070762634

Now trying (0,0,6)|(0,0,0)
Values:   0.647949576728 0.647949576728
Timings:  0.0812270641327 0.0880320072174

Still looks pretty inconclusive. Now try out with 10000 calls of the
C-version of VRR:
(6,0,0):
Values:   0.647949576728 0.647949576728
Timings:  0.264533996582 0.275687932968

(0,6,0):
Values:   0.647949576728 0.647949576728
Timings:  0.261992931366 0.300312995911

(0,0,6):
Values:   0.647949576728 0.647949576728
Timings:  0.275708913803 0.264741182327

Run the same test for (2,0,0):
(2,0,0):
Values:   0.911115537897 0.911115537897
Timings:  0.20596408844 0.21582698822

(0,2,0):
Values:   0.911115537897 0.911115537897
Timings:  0.203845977783 0.225471019745

(0,0,2):
Values:   0.911115537897 0.911115537897
Timings:  0.219904899597 0.21017408371

There are obviously minor differences here, but nothing really stands
out. There still might be a 1-5% difference, but probably not even
worth implementing right now.

Now trying a 6,0,0 case with xa!=0:
(6,0,0):
Values:   2.92921202968 2.92921202968
Timings:  0.279041051865 0.304389953613

(0,6,0):
Values:   2.92921202968 2.92921202968
Timings:  0.275798082352 0.291442871094

(0,0,6):
Values:   2.92921202968 2.92921202968
Timings:  0.308526039124 0.290766954422


Still no big difference. I don't know why putting the large values in
the y-direction wins, I would have thought it would either be the
x-dir or the z-dir. Perhaps there are two opposite effect, one of
which favors the x, and one which favors the z, so that when you add
them together, neither wins.


Seeing if there's a difference in the hrr integrals:
Timings, all one one center
(px,s,s,s): 0.018
(s,px,s,s): 0.0362
(s,s,px,s): 0.0183
(s,s,s,px): 0.0381

(py,s,s,s): 0.0204
(s,py,s,s): 0.0515
(s,s,py,s): 0.0190
(s,s,s,py): 0.0340

(pz,s,s,s): 0.0190
(s,pz,s,s): 0.0354
(s,s,pz,s): 0.0210
(s,s,s,pz): 0.0378

Timings, multiple centers:
(px,s,s,s): 0.0217
(s,px,s,s): 0.0388
(s,s,px,s): 0.0246
(s,s,s,px): 0.0600

(py,s,s,s): 0.0239
(s,py,s,s): 0.0534
(s,s,py,s): 0.0241
(s,s,s,py): 0.0406

(pz,s,s,s): 0.0206
(s,pz,s,s): 0.0408
(s,s,pz,s): 0.0215
(s,s,s,pz): 0.0433

I'm not sure there really is any low-hanging fruit. Here are all of
the integral calls for h2o. 

Whoops! Can't print them all out. But it's clear that there is a
wide-enough range of integrals that a routine to put integrals in the
fastest form would be worthwhile.

Here are the rules for such a routine:
1. Large values should be in the a or c position of (ab|cd) rather
than the b or d positions, which are almost twice as slow.
2. There may be some minor advantage to having large values in
x-direction rather than y- or z-directions, but this may be too small
to make a difference.
3. There may be some advantage to having large values in the ab
position of (ab|cd) rather than cd, but this also may be too small.

Unaltered routine for h2o test:
-76.0117511531 -76.011751 48.0270109177

AB/CD swaps:
40.36 seconds! Nice improvement.

Total timings for the test suite:
        No swaps  AB/CD swaps
crys    95.97     101.24
chgp	116.6	  114.44

Hmm. There is another wrinkle that I didn't think of, since evidently
the swaps don't help as much as I would have thought they would have.

Guess I still have to stick with crys/noswaps.

2004-03-28
Attempting to do a grid-based integration for the DFT J operator.
The routine is in dft.getJ_dft, and basically uses the nuclear
attraction integrals. I can rewrite this routine to pass in the entire
vector of grid points so that I only need a single call; such a
routine would also speed the nuclear attraction integrals, although
this is a negligible amount of time.

I had to put in an arbitrary factor of -0.5, but now the two J
matrices are within 1% or so, which is probably about right. The J_dft
is still much slower, but I know how to speed this up.

I got an answer of -75.7552 (rather than -75.8769). I think if I can
pull out the reference density the way Quest does this will work
better, since it will be less dependent on the grid density.

Tried to run this in the Quest module, and I should have had exactly
the same results as before, but instead I got a convergence that
looked like:
rmuller-mac.local.(PyQuante)127 % python Quest.py
  -75.5381   -19.2885   -56.4434    -8.6473     8.8411
  -67.8529   -14.9586   -49.9942   -11.7412     8.8411
  -70.5473   -16.4814   -51.7287   -11.1782     8.8411
  -69.9710   -16.1995   -51.2946   -11.3180     8.8411
  -70.1230   -16.2740   -51.4081   -11.2820     8.8411
  -70.0835   -16.2550   -51.3781   -11.2915     8.8411
  -70.0940   -16.2600   -51.3861   -11.2890     8.8411
  -70.0912   -16.2587   -51.3839   -11.2896     8.8411
  -70.0920   -16.2591   -51.3845   -11.2895     8.8411
  -70.0918   -16.2590   -51.3844   -11.2895     8.8411
  -70.0918   -16.2590   -51.3844   -11.2895     8.8411
I'm going to retest the h2o_dft.py module before doing further hacking
of Quest.py. Here's what h2o_dft puts out:
  -67.8110  -135.1371    70.2385   -11.7534     8.8410
  -69.0193  -100.3954    29.4116    -6.8766     8.8410
  -75.7066  -122.7374    46.9585    -8.7688     8.8410
  -75.7432  -121.3799    45.4159    -8.6202     8.8410
  -75.7415  -122.9286    47.1586    -8.8125     8.8410
  -75.7555  -122.3268    46.4611    -8.7309     8.8410
  -75.7553  -122.3431    46.4804    -8.7336     8.8410
  -75.7552  -122.3457    46.4834    -8.7339     8.8410
Here is Quest.py, with the terms regrouped to match what h2o_dft.py:
  -63.0657   -90.3172    24.8292    -6.4187     8.8411
  -70.6737  -134.2882    65.9109   -11.1374     8.8411
  -69.7190  -101.7558    30.1601    -6.9644     8.8411
  -71.2325  -133.8235    64.7252   -10.9754     8.8411
  -69.9365  -102.2607    30.4863    -7.0032     8.8411
  -71.2774  -133.7758    64.6184   -10.9610     8.8411
  -69.9559  -102.3054    30.5150    -7.0066     8.8411
Looks like the initial density is wrong. Here is another rewrite of
Quest that does exactly what h2o_dft does.
  -69.0192  -100.3954    29.4117    -6.8766     8.8411
  -71.0959  -133.9339    65.0108   -11.0138     8.8411
  -75.5511  -124.9383    49.6328    -9.0867     8.8411
  -75.2462  -116.7985    40.8187    -8.1075     8.8411
  -75.7501  -122.6441    46.8260    -8.7730     8.8411
  -75.7548  -122.3530    46.4921    -8.7349     8.8411
  -75.7551  -122.3474    46.4854    -8.7341     8.8411
Looks good; back up and see if the TF densities can achieve this:
  -63.0657   -90.3172    24.8292    -6.4187     8.8411
  -70.6737  -134.2882    65.9109   -11.1374     8.8411
  -75.3868  -125.9730    51.0128    -9.2676     8.8411
  -74.9264  -115.0601    39.2130    -7.9203     8.8411
  -75.7452  -122.7627    46.9611    -8.7846     8.8411
  -75.7545  -122.3746    46.5169    -8.7379     8.8411
Whooo! It will probably be faster if I renormalize the sum of dens0
over the atomic grid to the atomic number.
  -72.8593  -132.2248    61.0453   -10.5208     8.8411
  -70.6942  -103.6558    31.1935    -7.0730     8.8411
  -75.4598  -125.5551    50.4354    -9.1811     8.8411
  -74.9119  -115.3005    39.5154    -7.9678     8.8411
  -75.7476  -122.7112    46.9038    -8.7812     8.8411
  -75.7545  -122.4021    46.5475    -8.7410     8.8411
  -75.7550  -122.3572    46.4967    -8.7355     8.8411
  -75.7551  -122.3472    46.4851    -8.7341     8.8411
Turns out that the grids were surprisingly un-normalized:

Atno  Initial  Renorm
8     18.286   8
1     8.396    1
1     8.650    1

Can't imagine why the TF densities are so far off. This was with the
Sommer's rho. The Tietz rho is even worse:

8     21.068   8
1     10.013   1
1     10.244   1

Okay, fixed that bug; don't know where it came from. Here are the
current integrated densities:

8 5.63148221941    8.0
1 0.645113649623   1.0
1 0.658004445484   1.0

The smaller values probably comes from the fact that I'm truncating
my grid at too small a value.

Here are the new convergence behavior, first not renormalized, then
renormalized. 
  -72.3331  -132.9079    62.4156   -10.6818     8.8411
  -70.3846  -103.0869    30.9077    -7.0464     8.8411
  -75.5237  -125.1425    49.8949    -9.1171     8.8411
  -75.1231  -116.2361    40.3278    -8.0558     8.8411
  -75.7496  -122.6436    46.8258    -8.7729     8.8411
  -75.7549  -122.3699    46.5107    -8.7368     8.8411
normalized:
  -75.1270  -116.6675    40.7624    -8.0631     8.8411
  -74.2015  -129.8798    56.7957    -9.9584     8.8411
  -75.7577  -122.1051    46.2067    -8.7003     8.8411
  -75.7547  -122.3081    46.4422    -8.7298     8.8411
  -75.7553  -122.3299    46.4654    -8.7319     8.8411
  -75.7551  -122.3433    46.4807    -8.7336     8.8411
  -75.7551  -122.3451    46.4828    -8.7338     8.8411
Nice!

-----------------2004-03-31-----------------
I've written the cints routine that groups all of the nuclear centers
to a single call; this code is running fairly slowly, I think because
it's swapping. But it appears to have the right values.

-----------------2004-04-01-----------------
Here are the timings for one iteration using the nuclear and
nuclear_vec routines.

nuclear:
  -75.1270   -71.8019    -4.1032    -8.0631     8.8411
480.510u 10.310s 10:02.56 81.4% 0+0k 0+6io 0pf+0w

nuclear_vec:
  -75.1270   -71.8019    -4.1032    -8.0631     8.8411
333.360u 11.090s 7:08.62 80.3%  0+0k 0+6io 0pf+0w
Well, a little bit faster. 

I put in a function that excludes points with charge < 1e-8:
  -75.1270   -71.8022    -4.1029    -8.0631     8.8411
296.440u 10.220s 7:17.90 70.0%  0+0k 0+10io 0pf+0w

The right way to do this is to compute the distance of the grid point
to the gaussian center, and then exclude all points with
abs(w*q/r)<1e-10.
  -75.1270   -71.8019    -4.1032    -8.0631     8.8411
342.660u 13.940s 7:55.40 75.0%  0+0k 0+5io 0pf+0w

1e-8:
  -75.1270   -71.8019    -4.1032    -8.0631     8.8411
351.700u 11.860s 7:10.08 84.5%  0+0k 1+9io 0pf+0w
Huh? Don't understand. Maybe the overhead with computing the distance
is simply too high. It's easy to speed up using a blocked grid, or
using the atom distances. Also, this is a pretty small molecule, and
there probably isn't enough distance to justify doing too much work
on each call.

1e-6:
  -75.1271   -71.8025    -4.1026    -8.0631     8.8411
344.340u 11.930s 7:03.60 84.1%  0+0k 0+11io 0pf+0w
Wow. I don't get it. 

Try just using w*q now: 1e-8:
  -75.1270   -71.8019    -4.1032    -8.0631     8.8411
322.710u 12.220s 10:30.47 53.1% 0+0k 2+8io 0pf+0w

Seems like abs(q)>1e-8 is the best way to do this.

I'm going to profile the code with the nuclear_vec call:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1872  181.030    0.097  181.030    0.097 PGBF.py:106(nuclear_vec)
        2   89.970   44.985  332.550  166.275 dft.py:66(getJ_dft_vec)
  3046848   32.020    0.000   32.020    0.000 GridPoint.py:48(xyzw)
  1518400   15.830    0.000   15.830    0.000 GridPoint.py:50(ddens)
  1546432   12.810    0.000   12.810    0.000 GridPoint.py:49(dens)
   196224    6.550    0.000    6.550    0.000 PGBF.py:114(amp)
     4695    5.110    0.001    5.110    0.001 Numeric.py:229(concatenate)
   116800    4.950    0.000   11.500    0.000 CGBF.py:138(amp)
     5927    2.230    0.000    2.230    0.000 __init__.py:40(dot)
     4672    2.040    0.000   13.590    0.003 GridPoint.py:78(set_bf_amps)
        3    1.460    0.487    6.560    2.187 AtomicGrid.py:173(allbfs)
     9186    1.240    0.000    1.590    0.000 DFunctionals.py:78(vwn_eps)
     9186    0.960    0.000    1.300    0.000 DFunctionals.py:85(vwn_deps)
      650    0.740    0.001    0.740    0.001 MolecularGrid.py:90(points)
    64302    0.690    0.000    0.690    0.000 DFunctionals.py:76(xx)


Here's the problem: nuclear vec is called inside the PGBF unpacking
routines. Which means that I'm still looping over all of the PGBF
functions before calling it, hence the many, many calls.

Before I code up that routine, I've rewritten the PGBF version
slightly, unpacking the grid only once, and passing in separate
arrays for xc, yc, zc, wc, and qc. I'm going to see how much this
speeds things up, it should be a nice speed bump, I think. I should
have been tipped off by the fact that in the profiling, nuclear_vec
amounted to 181 sec total time, whereas getJ_dft_vec amounts to 333
sec. I believe that it is this difference that I've removed now.

  -75.1270   -71.8022    -4.1029    -8.0631     8.8411
199.070u 7.000s 4:04.75 84.1%   0+0k 0+11io 0pf+0w

Nice. Here's the profile:

     1872  166.130    0.089  166.130    0.089 PGBF.py:106(nuclear_vec)
   196224    7.020    0.000    7.020    0.000 PGBF.py:114(amp)
   116800    4.960    0.000   11.980    0.000 CGBF.py:146(amp)
     4695    4.550    0.001    4.550    0.001 Numeric.py:229(concatenate)
     4672    2.450    0.001   14.460    0.003 GridPoint.py:78(set_bf_amps)
     5927    1.970    0.000    1.970    0.000 __init__.py:40(dot)
        3    1.410    0.470    5.930    1.977 AtomicGrid.py:173(allbfs)
     9186    1.220    0.000    1.550    0.000 DFunctionals.py:78(vwn_eps)
     9186    0.770    0.000    1.110    0.000 DFunctionals.py:85(vwn_deps)
    64302    0.670    0.000    0.670    0.000 DFunctionals.py:76(xx)
        1    0.540    0.540    1.420    1.420 MolecularGrid.py:64(patch_atoms)
        2    0.350    0.175    3.010    1.505 DFunctionals.py:48(VWN)
    18037    0.350    0.000    0.350    0.000 Vec3.py:49(tuple)
    18037    0.340    0.000    0.690    0.000 Atom.py:51(pos)
        2    0.310    0.155  167.070   83.535 dft.py:66(getJ_dft_vec)

Only 1 sec in getJ_dft_vec, now!


Looks like it's actually slower with all of the calls inside of the C
routine
  -75.1270   -71.8022    -4.1029    -8.0631     8.8411
in 222 sec

Here's the profiling:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      650  187.000    0.288  187.090    0.288 CGBF.py:130(nuclear_vec)
   196224    7.300    0.000    7.300    0.000 PGBF.py:114(amp)
     4695    6.010    0.001    6.010    0.001 Numeric.py:229(concatenate)
   116800    5.560    0.000   12.860    0.000 CGBF.py:151(amp)
     4672    2.310    0.000   15.240    0.003 GridPoint.py:78(set_bf_amps)

Obviously I'm going back to the original one, since the code is a lot
easier. 


trying abs(q*w)>1e-4:
  -75.1097   -71.9004    -3.9789    -8.0714     8.8411
in 183.5 sec

Looks like the overhead isn't worth the multiply cost.


------------------2004-10-19------------------
Relicensed the entire code under the modified BSD license.

------------------2005-03-11-------------------
Removed get_nel() and get_enuke() from the hartree_fock module, and
made these instance methods of Molecule.

------------------2005-07-07--------------------
Moved the installation of cints, crys, and chgp into the PyQuante
directory. These routines now need to be called as PyQuante.cints, etc.

------------------2005-09-13--------------------
MAJOR: made 'charge' a property of Molecule (default=0).
Tried to make this change in a reasonably reliable manner
by putting a deprecation warning in Molecule.get_nel(charge).

Removed 'charge' from scf argument list in hartree_fock.py and dft.py

-----------------2005-10-13----------------------
Added Travis Oliphant's optimize.py package.

Cleaned up the EXX.py module in preparation for adding it.

Rearranged some of the entries in hartree_fock.py and dft.py. The
major change is that optional arguments are now largely passed 
through an **opts dictionary, which makes it much easier to 
pass these in. But this may break some existing code.

Wrote a C version of the three_center_1D function, which is much 
faster.

----------------2005-10-24------------------------
Started to work on a spin-averaged, open-shell DFT. 

Replaced mkdens with mkdens_spinavg in dft.py. This allows dft
calculations when nopen != 0. This should also be fairly easy to
extend if I want to do a fermi-dirac occupation, which will now
be a trivial extension.

Need two test cases: (1) h2o, to make sure it's still the same, and 
(2) no_dft, to see whether it works.

MAJOR: Fixed a bug in the grid construction. When the spingrid
routines were called in AtomicGrid they were modifying the Lebedev
data structure, which means that the angular grids were slightly
different each time the shells were constructed. This manifested
itself in different energies for the h2o_dft case, depending upon
whether it was run as part of the test suite, where it gave -75.8769,
or alone, where it gave -75.8572. Both versions now give
-75.857635. Whew!

----------------2005-10-25------------------------
Created Molecule.get_closedopen() and Molecule.get_alphabeta()
functions. 

Modified the uhf module to use **opts options and to use the 
Molecule.get_alphabeta() function.

Added no_uhf to the test suite. 

Tagged everything with Release_1_4, in preparation for new point
release. 

*** Released v1.4.0. ***

Modified test suite to keep track of test failures.

Changed basis -> basis_data for imports from basis set data files.
There were too many variables that could have been named 'basis'
in the program, and I think this clarifies things.

Also removed all extraneous uses of basis imports (since getbasis
now uses 6-31G** as the default).

----------------2005-10-26------------------------
Created Molecule.add_basis() and Molecule.add_grid(). Both
use much nicer, **kwarg based inputs. Neither is used right
now, but the plan is to eventually use these as the default
constructors for grids and basis sets. Neither function returns
anything, but rather defines Molecule.basis and Molecule.grid.

---------------2005-10-28------------------------
Made some changes to the AtomicGrids module, and now the Pople
style SG1 grids are much closer to working. The AtomicGrids module
is also much nicer now.

---------------2005-11-04------------------------
Programmed up the Becke projection operator for patching grids
together. This has dramatically increased the accuracy of the
calculations. Here's a comparison of the old way of patching grids to
the new way, for H2 at different radii.

R/A      E_old/h          E_becke/h
0.650 -1.13002187576    -1.12533931817  
0.675 -1.13024829978    -1.12972578301  
0.700 -1.13973398848    -1.13277161916  
0.725 -1.14376880262    -1.1346726221   
0.750 -1.12933707909    -1.13559518374  
0.775 -1.12775123871    -1.13568116218  
0.800 -1.1290794449     -1.13505176912  
0.825 -1.12602677648    -1.13381068973  
0.850 -1.13026387361    -1.1320469249   
0.875 -1.12724445575    -1.12983713496  
0.900 -1.12701602849    -1.12724778462  
0.925 -1.12350721616    -1.12433676769  
0.950 -1.11993123631    -1.12115453167  
0.975 -1.11903716591    -1.11774506436  
1.000 -1.10636018922    -1.11414655987  

This is so dramatically better that I plan to release a new version of
the code with this feature activated much sooner than I normally
would. Still need to code up the hack Becke uses for patching atoms of
different types, but as soon as that's done I'll release a new version
of PyQuante. 

Coded up the heteroatom corrections. Here are results for LiH:
R      No hetero      Heteroatom corrections
0.65  -7.47130290273  -7.46902627618
0.70  -7.54989675061  -7.54866165454
0.75  -7.60078188382  -7.61488157148
0.80  -7.65401308461  -7.67005929919
0.85  -7.70184453064  -7.71604102355
0.90  -7.75241970831  -7.75429643437
0.95  -7.77862190619  -7.78602878926
1.00  -7.79469184381  -7.81223566871

I'm also going to set this by default. Although note that Pople
doesn't use these corrections. I'm mainly choosing these because they
"look" more smooth, but this is by no means a real comparison.

Also expanded the test suite to allow do_hf, do_dft, do_only_hf, etc,
flags to be passed in as keyword arguments.

---------------2005-11-08------------------------
Tested removing the spingrid corrections. They don't really do much
now that we have the Becke projection weights in there, and no one
else uses them.

Here's the H2 dissociation test:
       Spin      Nospin
0.650 -1.12534  -1.12527
0.700 -1.13277  -1.13271
0.750 -1.13560  -1.13555
0.800 -1.13505  -1.13502
0.850 -1.13205  -1.13203
0.900 -1.12725  -1.12726
0.950 -1.12115  -1.12120
1.000 -1.11415  -1.11423

Here's the LiH dissociation
       Spin      NoSpin
0.650 -7.46903  -7.47103
0.700 -7.54866  -7.55069
0.750 -7.61488  -7.61693
0.800 -7.67006  -7.67208
0.850 -7.71604  -7.71798
0.900 -7.75430  -7.75612
0.950 -7.78603  -7.78773
1.000 -7.81224  -7.81381

I'm going to take out the spingrid option, since I'm going to be
modifying the grid stuff a lot as I code up the Becke Poisson solver,
and I want the code to be as simple as possible. I would normally just
change the options, but in this case I want to make the constructors
easier. The spingrids routines are still in AtomicGrid.py if anyone
wants to use them.

Also updated the benchmark energies in the dft test cases. 

In a similar type of cleaning, I removed the setdens0 and dens0 type
functions and variables. This will break the Quest module, but that
was essentially broken anyway. I'm keeping the Quest module around for
historical reasons, but I should probably remove it.

Does pruning the points really speed up the code? The dft test suite
took 109.47 sec. Commenting out the pruning the suite takes 106.49
sec. I'm going to remove all of the pruning code to simplify the grid
code. 

Decided that the patch_atoms_becke worked well enough that I just
killed the old-fashioned patching scheme.

---------------2005-11-14------------------------
Wrote some test cases for the spin-averaged open-shell DFT.
Appears to work well.


---------------2005-11-14------------------------
Working on Fermi-Dirac occupations of DFT.

Removed the fdscf and fdhf routines (actually, just commented out).
Moved fdhf.py to fermi_dirac.py.

Testing closed shell system: h2_ft_dft.py.
Setting occs to:
 [1.0,0.0]       -1.1327  (correct energy)
 [0.99,0.01]     Diverges positive
 [0.999,0.001]   Diverges positive

Fixed a bug. Changed:
        d = c[i,:]
to
        d = c[i:i+1,:]
in mkdens_occs. The former is a vector, the latter is a matrix.
Setting occs to:
 [1.0,0.0]       -1.1327  (correct energy)
 [0.999,0.001]   -1.1318
 [0.99,0.01]     -1.1233
 [0.9,0.1]	-1.0376

Now testing with Temps:
  T=1      -1.1327
  100	   -1.1327
  10000	   -1.1323
  100000   -0.6467

Test open shell system (Li)
  T=0      -7.3321
  T=1      -7.3332 (conv probs)
  T=1e2    -7.3332 (conv probs)
  T=1e4    -7.3084
  T=1e5    -7.0211
Conv problems go away when I tighten the tolerance in get_efermi.

Removed Vec3 stuff.

Added li_ft_dft and h2_ft_dft to the Test Suite.

---------------2005-11-21------------------------
Working on Fermi-Dirac occupations of HF.
Seems to be working. See h2_ft_hf.py test case.


---------------2005-12-05------------------------
Started working on density gradients and GGA functionals.

Fixed a bug in PGBF.grad().

Looks like density gradients now work.

---------------2005-12-06------------------------
Reworked the interface to rhf and dft, which, among other things,
involved removing the scf and dft_scf modules.

Added the EXX.py module to the repository. Reworked the interface so
that one no longer needs to instantiate a class to run the code.

Added test_exx.py module to the tests directory.

--------------2005-12-07-------------------------
Started looking at writing GGA functionals, but it's harder than 
I had initially thought.

The code to compute the gradients is now put into the grid point at
construction, so I don't have to keep passing in the option.

I put together xfuncs and cfuncs, two dictionaries that return the
appropriate DFT X or C function given a string. These can return
'None' for the correlation function, in, say, S0.

------------2005-12-08---------------------------
Got spin-polarized versions of the S and VWN functionals
working. These functions are now called by default.

------------2005-12-12---------------------------
Got BLYP working over the weekend. Still have to look at some of the
higher-order terms to see how I need to change things.

------------2005-12-16---------------------------
Getting PBE working. Currently only have the first order terms
working for PBE and BLYP, but the energies seem to be pretty good thus
far. 

Currently trying to get a "readable" PBE working. 

Added an option where guess orbitals can be passed into rhf or dft.
Added an option where the code returns en,orbe,orbs from hf/dft
routines. Made sure that all of the test suite worked with this.

Updated the EN2 routine to use the UHF routines and to use calling
conventions consistent with the rest of the code.

------------2005-12-17---------------------------
Added hartree_fock.hf, which is an interface to both rhf and uhf.
Ultimately this will check spin_type to alternately call rohf for open
shell systems.

------------2005-12-18---------------------------
Got a 'readable' PBE correlation functional together. Verified that it
reproduces the correct results.

Started work on the ROHF routines

------------2005-12-19---------------------------
Renamed EXX.py to OEP.py, since it's a more appropriate name.

------------2005-12-25---------------------------
Molecules can now be constructed using atomic symbols instead
of atomic numbers in their constructor lists, e.g.

    h2o = Molecule('h2o',
                   [('O',(0.,0.,0.)),('H',(1.,0.,0.)),('H',(0.,1.,0.))],
                   units='Angstrom')

-----------2006-03-15----------------------------
Playing around with numpy-0.9.6 instead of Numeric.

Here's the test suite on Powerbook using Numeric: 347.1 - 363.1 sec
(range of multiple runs).

Wrote a wrapper routine that can import either from numpy or Numeric
based upon the value of use_numpy. This lets me use the switch to
modify the actions of the entire program.

I may have developed an error in H2/FT/DFT and Li/FT/DFT. Oh, I think
this is due to the inclusion of the entropy term.

Wow! Using numpy is a lot slower, at 615 sec!! Guess I'll stay with
Numeric for now.

While I'm mucking around, I'm going to switch back to Numeric, and
then run the entire test suite through a profiler.

Here's the profile:
   266290  162.990    0.001  250.450    0.001 CGBF.py:185(coulomb)
       40   44.420    1.110   44.850    1.121 Ints.py:159(get2JmK)
   398701   36.750    0.000   36.750    0.000 Numeric.py:330(dot)
   777728   35.230    0.000   35.230    0.000 PGBF.py:92(amp)
       49   34.880    0.712   35.580    0.726 Ints.py:122(getJ)
   456960   30.770    0.000   66.000    0.000 CGBF.py:110(amp)
    27181   23.070    0.001   23.070    0.001 Numeric.py:229(concatenate)
   149875   18.240    0.000   27.670    0.000 DFunctionals.py:386(vwn_eps)
  1049125   17.080    0.000   17.080    0.000 DFunctionals.py:383(vwn_xx)
  1097094   15.240    0.000   15.240    0.000 CGBF.py:47(norm)
   149875   15.150    0.000   22.800    0.000 DFunctionals.py:395(vwn_deps)
  1065160   14.650    0.000   14.650    0.000 CGBF.py:54(pnorms)
  1065160   14.620    0.000   14.620    0.000 CGBF.py:52(exps)
  1065160   14.510    0.000   14.510    0.000 CGBF.py:49(powers)
  1065160   14.500    0.000   14.500    0.000 CGBF.py:48(origin)

-----------2006-03-16----------------------------
Okay, found the mistake. Array elements in numpy are special types,
and scalar operations on these new types can be much slower. The big
killer were the grid routines in DFunctionals, and by replacing
dens[i] with float(dens[i]), forcing conversion to a standard python
float object, the time went from 615 sec to 297 sec, which is even
faster than the old routines. One nice thing about numpy is that the
new setup routines correctly identify the math libraries, which were a
pain to properly compile using Numeric 23.1.

On sahp5351, the test suite took 81.8 sec using Numeric, and 74.5 sec
using numpy. Nice little speedup.

-----------2006-04-05----------------------------
Switched over optimize to use either numpy or Numeric via NumWrap.
This involved some minor changes to NumWrap.

Experimented with my own version of BFGS, which doesn't seem
to work as well as Travis's.

----------2006-04-06------------------------------
Included a simple Logger module. Replaced all output during runtime
to go into a file via this module. Is properly flushed, etc.

----------2006-09-01-------------------------------
Compiled and ran the code on a Macintel Mac Mini. Test suite ran in 99 sec. 
My mac laptop runs in 272 sec.

----------2007-02-06-------------------------------
Creating a new option NumWrap.test_numpy to allow eigenvectors to be
stored in columns (numpy style) rather than rows (Numeric style). Both
Numeric and numpy are supported in the new version, and the test
suites all run regardless of (numpy vecs)/(numeric vecs) or
(numpy)/(numeric) choices.

---------2007-11-14-------------------------------- 

Hatem Helal has implemented HF forces, and that code appears to be
working. I decided to celebrate by making a point release (1.6.0), to
satisfy users who don't like to rely on the SVN code.

Hatem's code is in PyQuante/forces.py.

Other new code in this release include a better test suite
(Tests/UnitSweet, using the Python unittest module), better IO
handling (using the Python logging module), and a first pass at
object-oriented construction of Hamiltonians and the like (in
PyQuante/PyQuante2.py). 

--------2009-02-24------------------------------------
Releasing v 1.6.2 where I remove the shebangs since they make the FC
installation barf, plus a few more minor bugs along the way.

-----2010-09-02-----
* bfgrid is now a grid attribute; this is done to simplify the grid
  module in preparation for a rewrite.
* renamed allbfs() to make_bfgrid() in MolecularGrid and AtomicGrid
* put a bunch of fairly confusing code computing gradients for the
  dft grids into the MolecularGrid module.
* Created a new routine MG2 that contains the new molecular grids.

-----2011-01-11-----
Intended changes before the next PyQuante minor point release:
* Create a Defaults.py module that keeps program-wide default
  information; make sure that defaults can always be overrided by
  kwargs, but also make sure that anytime kwargs gets a default
  response, it's from Defaults and not something hardwired.
* Redo the integral code:
  - Separate one_e ints from two_e ints
  - Two different flags (C vs Python, and which package) control ints
    now. 
  - Make the libint call analogous to the cint or crys calls. The
    branch point may have to be moved to an earlier point in the code. 
  - Kill the fact in pyints, using math.factorial instead
  - Address Jussi's integral question in pyints/cints
  - Two e integral test suite
  - Code should all work even if Cython not installed
  - C-code compile without error messages
* Implement Jussi's ccbasis functions

Changes before next major point release:
* GGA:
  - BLYP
  - Consistent testing with libxc
  - LSD
* Numerical Frequencies & Thermochemistry
* Rotion type ROHF (+ GVB?)
* MINDO cleaning:
  - Remove verbosity from MINDO routines
  - Unify (or subclass) MINDO3 atoms with normal atoms
  - Use atom.get_nel_mindo() in Mindo routines rather than 
    atom.Z
  - Use molecule.get_nel() rather than MINDO3.get_nel() in mindo

Jussi wishlist:
* Extend the basis set definitions to handle shells of larger angular
 momentum. I would suggest doing this on-the-fly by using
 the libint-style algorithm

 for(int i=0; i<=am; i++) {
   int nx = am - i;
   for(int j=0; j<=i; j++) {
     int ny = i-j;
     int nz = j;

     add_function(nx,ny,nz);
    }
 }

* Obara-Saika evaluation of 1-electron integrals.
 I did this for my own code that is under development, it speeded up
 the evaluation by something like 25-30x. The most important bit are
 the nuclear attraction integrals: with big basis sets the THO routine
 is awfully slow. An OS routine will do it a *lot* faster.

* Implementation of a spherical harmonics basis.
 This is useful for basis sets with high angular momentum. The
 generation of the conversion matrices is a simple task (a couple of
 hundred lines of C++ code); one just has to convert the integrals to
 the spherical harmonics basis after they have been evaluated with
 cartesians. The memory requirements can be reduced by a significant
 amount, but the handicap is in the necessity of doing the 2-electron
 integral transforms.

* Direct formation of the Fock matrix
 I've done this in my own code that uses libint, however currently the
 implementation is a bit messy and a lot slower than using tabled
 integrals.
 Why is this important? For instance in an aug-cc-pVTZ calculation of
 the water dimer (184 basis functions) the ERIs take 1 G 158 M of
 memory. Going to aug-cc-pVQZ the memory requirement is already 14 G 85
 M. 

* Density fitting basis sets for pure DFT.
 This makes pure DFT calculations faster by at least an order of
 magnitude, since 2-electron integrals don't need to be calculated.

Someday/maybe
* Integrate Hatem's leapfrog.py with the stuff in Dynamics.py
* Divide-and-conquer linear scaling 
* DIIS in uhf
* Forces/Geometry Optimization
* Symmetry
* Hybrid functionals
* CI/MCSCF
* MPn/CCSD

