Sobel Vivado HLS Kernel using AXI Stream interface

In our previous post we designed a Sobel Filter HLS kernel using the AXI4 full interface for the data transfers. We wanted to explore if the AXI 4 Stream protocol improves the performance of our application. In order to use this protocol it was mandatory to use a DMA controller for the ports that use this interface.

The project is here.

So we start from the HLS kernel. We used the ap_axiu data type to perform the streamed input and output transfers as well as the write and read methods of the stream class (hls_stream.h).

We won’t get into much detail about the other parts of the code because it’s the same as before with some minor differences due to the different type of interface.

After we export the new IP, we have to connect it to the Zynq system.sobel_filter_system_stream

The DMA controller now needs some tinkering in order to work properly.


  1. Disable the Scatter Gather Engine
  2. Change the Width of the Buffer Length Register to 23 in order to be able to support 2^22 bytes of transfer in a single call.
  3. Change the Stream Data Width to 8 because we use unsigned chars
  4. Change the Max Burst Size to 256

After we generate the bitstream and export the hardware we have to write the code that uses this accelerator.

The main parts of the program are:

int init_dma(){
 XAxiDma_Config *CfgPtr;
 int status;

CfgPtr = XAxiDma_LookupConfig(XPAR_AXI_DMA_0_DEVICE_ID);
 print("Error looking for AXI DMA config\n\r");
 return XST_FAILURE;
 status = XAxiDma_CfgInitialize(&AxiDma,CfgPtr);
 if(status != XST_SUCCESS){
 print("Error initializing DMA\n\r");
 return XST_FAILURE;
 //check for scatter gather mode
 print("Error DMA configured in SG mode\n\r");
 return XST_FAILURE;
 /* Disable interrupts, we use polling mode */


The initialization of the kernel is

XSobel SBL;

status = XSobel_CfgInitialize(&SBL, &SBL_Config);
 if(status != XST_SUCCESS){
 xil_printf("Error: example setup failed\r\n");
 return XST_FAILURE;

// the interruption are not connected in fact.
 XSobel_InterruptDisable(&SBL, 1);


In order to apply the filter in the whole image we had to send chunks as big as the size of the Zedboard’s BRAM. So we execute the kernel 8 times.

 //start the accelerator
 //transfer A to the Vivado HLS block
 status = XAxiDma_SimpleTransfer(&AxiDma, (unsigned int) input + j*128*1024, dma_size_input, XAXIDMA_DMA_TO_DEVICE);
 if (status != XST_SUCCESS) {
 xil_printf("Error: DMA transfer matrix A to Vivado HLS block failed\n");
 return XST_FAILURE;
 while (XAxiDma_Busy(&AxiDma, XAXIDMA_DMA_TO_DEVICE)) ;

xil_printf("\rSend input done\r\n");

//get results from the Vivado HLS block
 status = XAxiDma_SimpleTransfer(&AxiDma, (unsigned int) output+j*128*SIZE+SIZE, dma_size_output, XAXIDMA_DEVICE_TO_DMA);
 if (status != XST_SUCCESS) {
 xil_printf("Error: DMA transfer from Vivado HLS block failed\n");
 return XST_FAILURE;
 while (XAxiDma_Busy(&AxiDma, XAXIDMA_DEVICE_TO_DMA)) ;
 xil_printf("\rReceive results done\r\n");

Note that we have to start the accelerator in every iteration.


As we can see using the AXI 4 Stream interface we are able to achieve a 3x acceleration over the -O3 version.

1.62s       |0.09s       |0.03s      |0.03s



Patsiatzis Nikolaos

Katsaros Nikolaos


Sobel Vivado HLS Kernel using AXI full interface

In this post we will explore the steps from creating and exporting an HLS IP to integrating it in a Zynq Design. Finally we will create an app in SDK that uses this peripheral in order to apply a sobel filter in an image read from a SD card connected to the board. We use the Zedboard development kit and Vivado 2016.4 tools for this project.

The project is here.

First of all we create our Sobel filter as a HLS Kernel. The implementation is basic but we added some pragmas and techniques in order to achieve better performance both in memory transactions and computations. Bear in mind that the problem is memory bound so we focused in this aspect.

In order to reduce the bottleneck created by the ddr accesses we used block rams to store the input data, process it, and then write it back to the ddr. For the transactions we used the memcpy command which gives us the ability to transfer data with bursts. The size of the block ram is due to the restrictions of the block ram slices in the Zedboard. In other case we would have used bigger block rams because we would have benefited handsomely from the bursts.

After we simulate the kernel and verify its correct functionality then we export it as an IP.

In order to use the implemented IP in Vivado we have to add the HLS project in the repository manager. Then in the add IP icon type the name of your IP and insert it in the block design.

Now its time to build our system. First of all we add the processing system and enable the S_AXI_HPO interface(set in 64 data width). Then we have to add a concat module as well as a gpio peripheral which we will use to connect the ap_ctrl interface with it.

The final block design screenshot is this.


We export the hardware and launch the SDK. We create a board support package and enable the xilffs library in order to have access to a SD card. We create a new application project and enable the -m flag in the gcc linker.

First of all we have to mount the sd card so we use this command to do that.

static FATFS FS_instance;
 const char *Path = "0:/";
 FRESULT result;
 result = f_mount(&FS_instance,Path, 0);
 if (result != FR_OK) {
 printf("Cannot mount sd\n");
 return XST_FAILURE;

Then we must properly initialize the GPIO and the HLS kernel with these functions

void sobel_init(unsigned char *input_addr,unsigned char *output_addr){
 //Kernel - Init
    XSobel_InterruptDisable(&Sbl, 1); 
    printf("Sobel kernel initialized with %d for input and %d for output\n",(int)XSobel_Get_in_pointer(&Sbl),(int)XSobel_Get_out_pointer(&Sbl));

The data structures and the functions are in the header files generated for the peripherals.

In order now to read an image from the sd card we use these set of commands

FRESULT f_in, f_out, f_golden;

Log_File = (char *)INPUT_FILE;
 f_in = f_open(&file1, Log_File,FA_READ);
 if (f_in!= FR_OK) {
 printf("File INPUT_FILE not found\n");
 return XST_FAILURE;



and to write

Log_File = (char *)OUTPUT_FILE;
 f_out = f_open(&file3, Log_File, FA_CREATE_ALWAYS | FA_WRITE);
 if (f_out!= FR_OK) {
 printf("File OUTPUT_FILE not found\n");
 return XST_FAILURE;

off =0;
 uint writtenBytes=0;
 while(writtenBytes!=SIZE*SIZE) {
 f_out = f_write(&file3,&output[off],SIZE*SIZE,&writtenBytes);
 if (f_out!=0) {
 xil_printf(" ERROR: f_write2 returned %d\r\n",f_out);
 return XST_FAILURE;



The most important part is the one that calls the init functions with the proper values in order to start the kernel and then monitor if the kernel has finished the processing.



The process described above can be used for any peripheral that can read and write to the ddr memory.


After some experiments the average time in seconds of each implementation is:

1.62s       |0.09s       |0.03s

As we can see this implementation achieves better performance than any software implementation.


Patsiatzis Nikolaos

Katsaros Nikolaos