Abstract:
Methods are disclosed for improving data processing performance in a processor using on-chip local memory in multiple processing units. According to an embodiment, a method of processing data elements in a processor using a plurality of processing units, includes: launching, in each of the processing units, a first wavefront having a first type of thread followed by a second wavefront having a second type of thread, where the first wavefront reads as input a portion of the data elements from an off-chip shared memory and generates a first output; writing the first output to an on-chip local memory of the respective processing unit; and writing to the on-chip local memory a second output generated by the second wavefront, where input to the second wavefront comprises a first plurality of data elements from the first output. Corresponding system and computer program product embodiments are also disclosed.