Sobel Vivado HLS Kernel using AXI Stream interface

In our previous post we designed a Sobel Filter HLS kernel using the AXI4 full interface for the data transfers. We wanted to explore if the AXI 4 Stream protocol improves the performance of our application. In order to use this protocol it was mandatory to use a DMA controller for the ports that use this interface.

The project is here.

So we start from the HLS kernel. We used the ap_axiu data type to perform the streamed input and output transfers as well as the write and read methods of the stream class (hls_stream.h).

We won’t get into much detail about the other parts of the code because it’s the same as before with some minor differences due to the different type of interface.

After we export the new IP, we have to connect it to the Zynq system.sobel_filter_system_stream

The DMA controller now needs some tinkering in order to work properly.


  1. Disable the Scatter Gather Engine
  2. Change the Width of the Buffer Length Register to 23 in order to be able to support 2^22 bytes of transfer in a single call.
  3. Change the Stream Data Width to 8 because we use unsigned chars
  4. Change the Max Burst Size to 256

After we generate the bitstream and export the hardware we have to write the code that uses this accelerator.

The main parts of the program are:

int init_dma(){
 XAxiDma_Config *CfgPtr;
 int status;

CfgPtr = XAxiDma_LookupConfig(XPAR_AXI_DMA_0_DEVICE_ID);
 print("Error looking for AXI DMA config\n\r");
 return XST_FAILURE;
 status = XAxiDma_CfgInitialize(&AxiDma,CfgPtr);
 if(status != XST_SUCCESS){
 print("Error initializing DMA\n\r");
 return XST_FAILURE;
 //check for scatter gather mode
 print("Error DMA configured in SG mode\n\r");
 return XST_FAILURE;
 /* Disable interrupts, we use polling mode */


The initialization of the kernel is

XSobel SBL;

status = XSobel_CfgInitialize(&SBL, &SBL_Config);
 if(status != XST_SUCCESS){
 xil_printf("Error: example setup failed\r\n");
 return XST_FAILURE;

// the interruption are not connected in fact.
 XSobel_InterruptDisable(&SBL, 1);


In order to apply the filter in the whole image we had to send chunks as big as the size of the Zedboard’s BRAM. So we execute the kernel 8 times.

 //start the accelerator
 //transfer A to the Vivado HLS block
 status = XAxiDma_SimpleTransfer(&AxiDma, (unsigned int) input + j*128*1024, dma_size_input, XAXIDMA_DMA_TO_DEVICE);
 if (status != XST_SUCCESS) {
 xil_printf("Error: DMA transfer matrix A to Vivado HLS block failed\n");
 return XST_FAILURE;
 while (XAxiDma_Busy(&AxiDma, XAXIDMA_DMA_TO_DEVICE)) ;

xil_printf("\rSend input done\r\n");

//get results from the Vivado HLS block
 status = XAxiDma_SimpleTransfer(&AxiDma, (unsigned int) output+j*128*SIZE+SIZE, dma_size_output, XAXIDMA_DEVICE_TO_DMA);
 if (status != XST_SUCCESS) {
 xil_printf("Error: DMA transfer from Vivado HLS block failed\n");
 return XST_FAILURE;
 while (XAxiDma_Busy(&AxiDma, XAXIDMA_DEVICE_TO_DMA)) ;
 xil_printf("\rReceive results done\r\n");

Note that we have to start the accelerator in every iteration.


As we can see using the AXI 4 Stream interface we are able to achieve a 3x acceleration over the -O3 version.

1.62s       |0.09s       |0.03s      |0.03s



Patsiatzis Nikolaos

Katsaros Nikolaos


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s