Order Details;

- The coursework should be written used C99 Version and the results should be put into a word file
- Answer all the questions on the question sheet (I will send to you later) except for those specifically states I need to do that on school’s computer
- Put all the functions and mains into separate .c file (i.e each .c file contains only one function)
- copy and paste the c code into a separate word file.

M 3/4/5 SC – Exercise #3 – — Dan Moore — Due: 22h 1 May 2015 Exercise #3-I 1 10 March 2014 The results, discussion and programs should be submitted as a single Word or Text file by Digitally signed Email before 22:00h on Friday 1 May 2015. C99 Version! Background and discussion The discrete Complex Fourier series represents the complex data xj at N discrete points as the sum of N exponential functions of complex amplitude yk : 1 0 2 exp , 0 . N j k k Ijk x y jN N ( where I·I = -1 ) (1) This operation may be thought of as a complex matrix times a complex vector: x = CN y , where x and y are complex vectors of size N such that x = [ x0, x1, …,xN-1 ] and y = [y0, y1,… ,yN-1], and CN is an N by N complex matrix whose jth, kth element is WN j·k ( 0 < j,k < N) where WN = exp(2 I /N) = cos(2 /N) + sin(2 /N) · I. (2) As such, it will take N2 complex multiplications and N(N-1) complex additions to find x given y . However, if the N/2 ‘even’ data points are put into a size N/2 complex vector ye = [ y0, y2,… ,yN-2] and the N/2 ‘odd’ data points are put into a size N/2 complex vector yo = [ y1, y3,… ,yN-1] it can be shown that if xe = CN/2 ye and xo = CN/2 yo then xj = [ xe]j + WN j [ xo]j and xj+N/2 = [ xe]j + WN j+N/2 [ xo]j for 0 < j < N/2 (3) Another way of stating this is to say that the N by N complex Matrix CN can be factored into the product of three matrices: ┌ ┐ ┌ ┐ ┌ ┐ │ IN/2 DN/2 │ │ CN/2 0 │ │ even – odd │ CN = │ │ · │ │ · │ Permutation │ , (4) │ IN/2 EN/2 │ │ 0 CN/2│ │ matrix PeoN │ └ ┘ └ ┘ └ ┘ where IN/2 is the N/2 identity matrix, DN/2 is the diagonal matrix with the diagonal entries WN j for row j, EN/2 is the diagonal Matrix with the diagonal entries WN j+N/2 for row j and PeoN is the permutation matrix that takes [y0, y1,… ,yN-1] to [y0, y2, y4,… ,yN-2, y1, y3,… yN-1]! This factorization of the Discrete Complex Fourier Series Matrix has been discovered several times and is known as the Danielson-Lanczos Lemma. If we use this factorization, a single N by N complex matrix multiply is replaced by two (N/2) by (N/2) matrix multiplies and N extra complex additions and multiplications. This reduces the total complex operation count by almost a factor of 2 ! Extracting the even and odd components of y can be done ‘at no cost’ by clever indexing. Multiplying the results of the two CN/2 matrix vector products by the diagonal matrices takes just N complex additions and complex multiplications. If N/2 is an even number, this algorithm can be applied recursively using CN/4 , CN/8 , CN/16 until N/2m is an odd number. Note that C1 = 1 ! So the recursive reduction can stop at this point. This algorithm is one of the Fast Fourier Transforms (FFTs) frequently mentioned in Numerical Analysis Literature. The Discrete Complex Fourier transform calculates the N complex Fourier coefficients yk from the N complex data points xj . From equation (1) this process is another complex Matrix.Vector multiplication: y = CN -1 x . (5) It can be shown that CN · (CN) * = N IN . That is, the complex conjugate of matrix CN is its inverse to within a numerical constant. So to calculate the Complex Fourier transform y = CN -1 x, M 3/4/5 SC – Exercise #3 – — Dan Moore — Due: 22h 1 May 2015 Exercise #3-I 2 10 March 2014 we only need to calculate DN/2* and EN/2* (why ?). This is easy to do by exploiting the fact that (WN * ) k = (WN) N-k , (6) so it is easy to modify a Discrete Complex Fourier Series code to calculate the Discrete Fourier Transform. The final M3SC Project Implement the Danielson-Lanczos Algorithm as described above to calculate the Discrete Complex Fourier Series Efficiently. This version assumes you are using C99 to work directly with the Complex Numbers and the complex function as defined in the C99 header file: 1. Write a C function to return a pointer to a vector of type complex double of size N with the entries: {cos(0) + I sin(0),cos(2/N) + I sin(2/N), cos(4/N) + I sin(4/N), cos(6/N) + I sin(6/N), . . . cos(2(N-1)/N) + I sin(2(N-1)/N) } Its prototype should be: complex double *MakeWpowers(int). This vector should be useful when you come to write FastDFS/T to eliminate the need to calculate powers of WN when you implement equation (3). 2. Write a C function to calculate the complex matrix-vector product: x = CN y (Discrete Fourier Series) recursively using the Danielson-Lanczos Algorithm described above. The prototype for this function should be: void FastDFS(complex double *x, complex double *y, complex double *w, complex double *Wp, int N, int skip) where: *x is a pointer to the type complex double array to hold x from x[0] to x[N-1]; *y is a pointer to the type complex double array holding y from y[0] to y[N-1]; *w is a pointer to the type complex double array w[0] to w[your choice!] that can be used as vector temporary storage for intermediate works arrays (think xe, yo etc.) to eliminate the need to call malloc() from inside FastDFS(…. *Wp is a pointer to the complex vector of powers of WN created in the function MakeWpowers(int) defined in section 1. N is the size of the desired transform skip is the spacing between the transform points. Skips are useful for multi-dimensional transforms and in exploiting the recursive structure of the DCT to minimize the movement of data. I suggest using the following strategy inside function FastDFS: (i) If N is odd, calculate x = CN y directly. [N =1, x0 = y0 !] . (ii) If N is even, load the even elements and odd elements of y into separate parts of w and call FastDFS twice for transforms of size N/2 . Then calculate x using equation (3). [Hints: 1. Write a C function to calculate the matrix CN directly for any N. 2. Use these and a general matrix-vector multiply function to generate synthetic data to help you find any logic errors. 3. Debug FastDFS for N = 1. then debug FastDFS for N = 2, and then N = 4 (remember to call FastDFS from within FastDFS with N/2 as the size. ) ( You can use skip at this point to reduce data movement!) 4. Use the direct CN function from Hint #1 to generate the y vector required to have xi = i. Use these data to demonstrate that FastDFS works for N = 8, 16, 32, etc.] M 3/4/5 SC – Exercise #3 – — Dan Moore — Due: 22h 1 May 2015 Exercise #3-I 3 10 March 2014 3. Write a C function to calculate the complex matrix-vector product: y = CN -1 x (the Discrete Fourier Transform) using the complex conjugate of the Danielson-Lanczos Algorithm described above. The prototype for this function should be: void FastDFT(complex double *x, complex double *y, complex double *w, complex double *Wp, int N, int skip) 4. How many complex floating point arithmetic operations are required to calculate x = C2 y ? How many real floating point operations? [Keep track of the additions and the multiplications separately! ] Call these numbers: F+ c2 , F * c2 (the complex operation count) and F+ r2 , F * r2, (the real operations count). If F+ cN , F * cN , F+ rN , F * rN are known, what are F+ c2N , F * c2N , F + r2N , F * r2N when you use the Danielson-Lanczos algorithm? Hence or otherwise calculate these 4 numbers from N = 2 up to N = 220 and present them in a suitable table. {You may use C, Maple, Excel or any program for this part!} What is the best fit of these numbers to the four functions: F(N) = A2 N log2 N + B2 N ? for F+ cN , F* cN , F+ rN , F* rN (four values of A2 and B2 : A2 + c , A2 * c , A2 + r , A2 * r etc.) Can you prove or demonstrate what values A2 and B2 should take for any value of N for this algorithm? 5. Determine the time taken by your code to run FastDFS for N = 2, 4, 8, 16, … , 220. [You may use compiler flags and OpenMP to optimize your code.] For what value of N is the fast transform twice as fast as a direct complex matrix-vector multiply? For what value of N is it 10 times faster? How much do the overheads of indexing addresses and of recursive function callings reduce the real speed of this algorithm compared to its theoretical maximum? What is the actual speed in megaflops

of your code for x = CN y for each value of N? How does this figure compare with the speed achieved by a Matrix-Vector multiplication Code from your last project? Plot the speed of your code in Megaflops vs Log2 N. 6. Add extra C Code to your FastDFS(… and FastDFT( … functions to deal with the cases N = 3, 6, 12, 24, 48,…. . Modify your complex double* MakeWPowers(int N) function (if necessary !) to generate the data for (WN) when N is of the form N = 3 * 2m Calculate F+ c3, F* c3, F+ r3, F* r3 . What are the floating point operations counts for the cases N = 3, 6, 12, 24, 48,…., 3*219 What the best values of A+,*3 and B+,*3 if equation (4) describes F+,* C/R Plot the 4 F+,*N values divided by N to get the average number of Floating Point operations per point vs log2[N] for the two series N = (1 & 3) * 2m . 7. Since (WN) 0 = 1 and (WN) N/2 = -1 , the floating point operations count for implementing equation (3) can be reduced to N-2 complex multiplications and N complex additions. The case x = C4 y can be performed directly using only 16 real additions (with careful coding). How do these efficiencies reduce the total floating point operation counts from N = 2 up to N = 220 and change A2 and B2 ? Can you reduce the operation counts for x = C3 y with careful coding? How does this affect the floating point operations counts for the cases N = 3, 6, 12, 24, 48,…., 3*219 (show tables !). Plot the speed in megaflops vs log2[N] for your improved code for all of these cases. Add the points for the ‘improved algorithms’ to your plot for Q6 to make another plot. M 3/4/5 SC – Exercise #3 – — Dan Moore — Due: 22h 1 May 2015 Exercise #3-I 4 10 March 2014 The discrete Fast Fourier Transform and Series of size 2 N may be used directly to perform a Fast Discrete Sine transform of vector S size N-1 by: a. Let Real Part(xj) = 0 for 0 < j < 2N ; b. Let Imag. Part (xj) = [S]j 1 < j < N-1 ; (Set Imag. Part (x0) = Imag. Part (xN) = 0!) c. Let Imag. Part (x2N-j) = – [S]j 0 < j < N d. Run your FastDFT function: y = CN -1 x e. Let [T]j = Real Part(yj) , 0 < j < N It should be the case that the vector T = 2 SN S . Use your code from Exercise #2 to demonstrate this [and debug your Fast Sine transform function]. 8. Write a void FastSINE(double *x, double *y,int N) function to implement the algorithm described above. Compare it for accuracy and speed with your best Matrix Mulitply function from Exercise #2 for the values N = 16, 24, 32, 48, … until you run out of time or storage. [Stop whenever either approach takes more than 100 second to perform SN S.] Do not include the time to calculate the matrix SN when comparing the direct Matrix-Vector multiply with your FastSINE() function. For what vale of N is your FastSINE() function faster than the direct Matrix-Vector multiply? For what values of N is it 4 times faster? 10 times faster? (If any?). 9. Use your your FastSINE() function in your code to solve the Poisson’s equation set out in Exercise 2. What are the new limits for the size of the 1-D and 2-D problem you can solve using the FFT approach? Mastery: (Worth up to 5% extra for M3SC): Add extra C Code to your FastDFS(… and FastDFT( … functions to deal with the cases N = 5, 10, 20, 40, 80 ,…., . Modify your double* MakeWPowers(int N) function (if necessary !) to generate the data for (WN) when N is of the form N = 5 * 2m . Calculate F+ c5, F* c5, F+ r5, F* r5 . What are the floating point operations counts for the cases N = 5, 10, 20, 40, 80,…., 5*218 What the best values of A+,*3 and B+,*3 and A+,*5 and B+,*5 if equation (4) describes F+,* C/R Plot the 4 F+,*N values divided by N to get the average number of Floating Point operations per point vs log2[N] for the three series N = (1, 3 & 5) * 2m . Compare and discuss. Time your 2-D and 3-D Poisson Solvers for the new values of N = 5, 10, 20, 40, 80,…., What are the largest 2-D and 3-D cases you can run on the PCs in the MLC and still get a solution in 100 seconds?