EPOLLEXCLUSIVE Linux Kernel patch testing

Overview

For high availability and load balancing systems, it is normal practice to run multiple process copies (executable copies) over some event source, for example, Posix Queue. When one message arrives it is expected only one process to process the message.

Linux Kernel offers two features for this purpose, it have a Posix Queues and it have a Linux specific feature to allow doing epoll() over the Posix Queues. The second feature allows to create a binaries which are doing listen over the multiple Queues. So far it is good. Also it is expected, if we bring up multiple processes waiting on same set of queues via epoll(), to do some load balancing, we should get more parallel processing power. (At some systems it might be up to 250 or more processes listening on some set of queues. For example processes which are responsible for sending messages to network and waiting for reply back). For small amount of process copies on Linux system it looks to perform ok, but for those 250 for example, by doing tests, it have been noticed that systems gets slower doing a lot of kernel processing.

The problem & solution

By doing some research, I have learned, that current Linux kernel versions (at writting this 05.12.2015), when multiple processes or threads doing epoll() on same event source (e.g. Posix queue), and when one event arrives, all processes or threads are woken up. This makes a lot of overhead. It looks like we need some kind of mechanism that one event wakes up one process.

This functionality have been proposed by Jason Baron here in Linux mailing lists: https://lkml.org/lkml/2015/2/9/540

But seems this patch is still not accepted, thus in discussion with Jason, I have done some testing with reduced patch, only with EPOLLEXCLUSIVE flag:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e009ca..265fa7b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,7 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1002,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1067,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
+		ewake = 1;
 		wake_up_locked(&ep->wq);
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1078,6 +1081,9 @@ out_unlock:
 	if (pwake)
 		ep_poll_safewake(&ep->poll_wait);
 
+	if (epi->event.events & EPOLLEXCLUSIVE)
+		return ewake;
+
 	return 1;
 }
 
@@ -1095,7 +1101,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (epi->event.events & EPOLLEXCLUSIVE)
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1861,6 +1870,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (f.file == tf.file || !is_file_epoll(f.file))
 		goto error_tgt_fput;
 
+	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
+		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
+		goto error_tgt_fput;
+
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..925bbfb 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,9 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.

Patch testing

Testing is based on Enduro/X framework. Test case is very simple – Bank client application is doing requests to load balanced Bank Server executable, with system running 250 copies of the bank server binaries (this is normal executable doing epoll() on multiple Posix queues). The base code and configuration could be found here: Getting Started Tutorial. Conceptual diagram can be seen in featured post image.

System config

Testing is done on following machine:

Host system: Linux Mint Mate 17.2 64bit, kernel: 3.13.0-24-generic
CPU: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz (two cores)
RAM: 16 GB
Visualization platform: Oracle Virtual Box 4.3.28
Guest OS: Gentoo Linux 2015.03, kernel 4.3.0-gentoo, 64 bit.
CPU for guest: Two cores
RAM for guest: 5GB
Enduro/X version: 2.3.2

Enduro/X patching for accepting EPOLLEXCLUSIVE flag

The Enduro/X ATMI server process polling is set in libatmisrv/svqdispatch.c. At process initialization, it is doing the opening of the listening queues and adding them to the epoll(). The change snippet is following


#define EPOLLEXCLUSIVE (1 << 28)

/**
* Open queues for listening.
* @return
*/
public int sv_open_queue(void)
{
int ret=SUCCEED;
int i;
svc_entry_fn_t *entry;
struct epoll_event ev;
int use_sem = FALSE;

...

for (i=0; i&lt;G_server_conf.adv_service_count; i++)
{

/* !!! Patched version: */
ev.events = EPOLLIN | EPOLLERR | EPOLLEXCLUSIVE;

/* !!!! Original version: */
/*ev.events = EPOLLIN | EPOLLERR ;*/
ev.data.fd = G_server_conf.service_array[i]-&gt;q_descr;
/*NDRX_LOG(log_debug, "fd %d == entry %d", ev.data.fd, ev.data.u64);*/
/*ev.data.u32 = i;*/
/*ev.data.ptr = G_server_conf.service_array;*/
if (FAIL==epoll_ctl(G_server_conf.epollfd, EPOLL_CTL_ADD,
G_server_conf.service_array[i]-&gt;q_descr, &amp;ev))
{
_TPset_error_fmt(TPEOS, "epoll_ctl failed: %s", strerror(errno));
ret=FAIL;
goto out;
}
}

out:
return ret;
}

Test client app

The client source code have been modified to run total of 1’000’000 requests from 10 Posix threads.

Client source code:

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <math.h>

#include <unistd.h>     /* Symbolic Constants */
#include <sys/types.h>  /* Primitive System Data Types */ 
#include <errno.h>      /* Errors */
#include <stdio.h>      /* Input/Output */
#include <stdlib.h>     /* General Utilities */
#include <pthread.h>    /* POSIX Threads */


#include <atmi.h>
#include <ubf.h>
#include <bank.fd.h>

#define SUCCEED		0
#define FAIL		-1

#define THREADCOUNT	10

void main_th(void *ptr);
int M_is_error = 0;

/**
 * Main entry
 */
int main(int argc, char** argv)
{

	pthread_t thread[THREADCOUNT];  /* thread variables */
	int i;

	for (i=0; i<THREADCOUNT; i++)
	{
		pthread_create (&thread[i], NULL, (void *) main_th, NULL);
	}

	/* wait for finish... */
	for (i=0; i<THREADCOUNT; i++)
	{
		pthread_join(thread[i], NULL);
	}

	return (M_is_error?FAIL:SUCCEED);

}

/**
 * Do the calls to balance server
 */
void main_th(void *ptr)
{
	int ret=SUCCEED;
	UBFH *p_ub;
	long rsplen;
	double balance;
	int i;
	

	/* allocate the call buffer */
	if (NULL== (p_ub = (UBFH *)tpalloc("UBF", NULL, 1024)))
	{
		fprintf(stderr, "Failed to realloc the UBF buffer - %s\n", 
			tpstrerror(tperrno));
		ret=FAIL;
		goto out;
	}
	
	/* Set the data */
	if (SUCCEED!=Badd(p_ub, T_ACCNUM, "ACC00000000001", 0) ||
		SUCCEED!=Badd(p_ub, T_ACCCUR, "USD", 0))
	{
		fprintf(stderr, "Failed to get T_ACCNUM[0]! -  %s\n", 
			Bstrerror(Berror));
		ret=FAIL;
		goto out;
	}
	
	for (i=0; i<(1000000/THREADCOUNT); i++)
	{
	/* Call the server */
	if (FAIL == tpcall("BALANCE", (char *)p_ub, 0L, (char **)&p_ub, &rsplen,0))
	{
		fprintf(stderr, "Failed to call BALANCE - %s\n", 
			tpstrerror(tperrno));
		
		ret=FAIL;
		goto out;
	}
	}
	
	/* Read the balance field */
	
	if (SUCCEED!=Bget(p_ub, T_AMTAVL, 0, (char *)&balance, 0L))
	{
		fprintf(stderr, "Failed to get T_AMTAVL[0]! -  %s\n", 
			Bstrerror(Berror));
		ret=FAIL;
		goto out;
	}
	
	printf("Account balance is: %.2lf USD\n", balance);
	
out:
	/* free the buffer */
	if (NULL!=p_ub)
	{
		tpfree((char *)p_ub);
	}
	
	/* Terminate ATMI session */
	tpterm();
	
	if (SUCCEED!=ret)
		M_is_error = 1;

	return;
}

Test server app

Bank server source code which are doing epoll() on Queues (it have been modified to remove the logging – commented out):


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

/* Enduro/X includes: */
#include <atmi.h>
#include <ubf.h>
#include <bank.fd.h>

#define SUCCEED		0
#define FAIL		-1

/**
 * BALANCE service
 */
void BALANCE (TPSVCINFO *p_svc)
{
	int ret=SUCCEED;
	double balance;
	char account[28+1];
	char currency[3+1];
	BFLDLEN len;

	UBFH *p_ub = (UBFH *)p_svc->data;

/*	fprintf(stderr, "BALANCE got call\n"); */

	/* Resize the buffer to have some space in... */
	if (NULL==(p_ub = (UBFH *)tprealloc ((char *)p_ub, 1024)))
	{
		fprintf(stderr, "Failed to realloc the UBF buffer - %s\n", 
			tpstrerror(tperrno));
		ret=FAIL;
		goto out;
	}
	
	
	/* Read the account field */
	len = sizeof(account);
	if (SUCCEED!=Bget(p_ub, T_ACCNUM, 0, account, &len))
	{
		fprintf(stderr, "Failed to get T_ACCNUM[0]! -  %s\n", 
			Bstrerror(Berror));
		ret=FAIL;
		goto out;
	}
	
	/* Read the currency field */
	len = sizeof(currency);
	if (SUCCEED!=Bget(p_ub, T_ACCCUR, 0, currency, &len))
	{
		fprintf(stderr, "Failed to get T_ACCCUR[0]! -  %s\n", 
			Bstrerror(Berror));
		ret=FAIL;
		goto out;
	}
	
	/*
	fprintf(stderr, "Got request for account: [%s] currency [%s]\n",
			account, currency);
	*/

	srand(time(NULL));
	balance = (double)rand()/(double)RAND_MAX + rand();

	/* Return the value in T_AMTAVL field */
	
/*	fprintf(stderr, "Retruning balance %lf\n", balance); */
	

	if (SUCCEED!=Bchg(p_ub, T_AMTAVL, 0, (char *)&balance, 0L))
	{
		fprintf(stderr, "Failed to set T_AMTAVL! -  %s\n", 
			Bstrerror(Berror));
		ret=FAIL;
		goto out;
	}

out:
	tpreturn(  ret==SUCCEED?TPSUCCESS:TPFAIL,
		0L,
		(char *)p_ub,
		0L,
		0L);
}

/**
 * Do initialization
 */
int tpsvrinit(int argc, char **argv)
{
	if (SUCCEED!=tpadvertise("BALANCE", BALANCE))
	{
		fprintf(stderr, "Failed to initialize BALANCE - %s\n", 
			tpstrerror(tperrno));
	}
}

/**
 * Do de-initialization
 */
void tpsvrdone(void)
{
	fprintf(stderr, "tpsvrdone called\n");
}

Enduro/X runtime config

The Enduro/X Configuration for this test is following:


<?xml version="1.0" ?>
<endurox>
    <appconfig>
         <!-- ALL BELLOW ONES USES <sanity> periodical timer  -->
         <!-- Sanity check time, sec -->
         <sanity>5</sanity>
         <!--
             Seconds in which we should send service refresh to other node.
         -->
         <brrefresh>6</brrefresh>
         
         <!--  <sanity> timer, end -->
         
         <!-- ALL BELLOW ONES USES <respawn> periodical timer  -->
         <!-- Do dead process restart every X seconds 
         NOT USED ANYMORE, REPLACED WITH SANITY!
         <respawncheck>10</respawncheck>
         -->
         <!-- Do process reset after 1 sec -->
         <restart_min>1</restart_min>
         <!-- If restart fails, then boot after +5 sec of previous wait time -->
         <restart_step>1</restart_step>
         <!-- If still not started, then max boot time is a 30 sec. -->
         <restart_max>5</restart_max>
         <!--  <sanity> timer, end -->
         
         <!-- Time after attach when program will start do sanity & respawn checks,
              starts counting after configuration load -->
         <restart_to_check>20</restart_to_check>
         
         <!-- Setting for pq command, should ndrxd collect service 
              queue stats automatically
         If set to Y or y, then queue stats are on.
         Default is off.
         -->
         <gather_pq_stats>Y</gather_pq_stats>
         
	</appconfig>
    <defaults>
        <min>1</min>
        <max>2</max>
        <!-- Kill the process which have not started in <start_max> time -->
        <autokill>1</autokill>
        <!--
        <respawn>1<respawn>
        -->
        <!--
            <env></env> works here too!
        -->
         <!-- The maximum time while process can hang in 'starting' state i.e.
            have not completed initialization, sec
            X <= 0 = disabled 
        -->
         <start_max>2</start_max>
         <!--
            Ping server in every X seconds (step is <sanity>).
         -->
         <pingtime>1</pingtime>
         <!--
            Max time in seconds in which server must respond.
            The granularity is sanity time.
            X <= 0 = disabled 
         -->
         <ping_max>4</ping_max>
         <!--
            Max time to wait until process should exit on shutdown
            X <= 0 = disabled 
         -->
         <end_max>5</end_max>
         <!-- Interval, in seconds, by which signal sequence -2, -15, -9, -9.... will be sent
         to process until it have been terminated. -->
         <killtime>1</killtime>
         <!-- List of services (comma separated) for ndrxd to export services over bridges -->
    <!--     <exportsvcs>FOREX</exportsvcs> -->
	</defaults>
	<servers>
		<!-- This is binary we are about to build -->
		<server name="banksv">
			<srvid>1</srvid>
			<min>250</min>
			<max>250</max>
			<sysopt>-e /opt/app1/log/BANKSV -r</sysopt>
		</server>
	</servers>
</endurox>

We see that process copy count is set to 250 (min/max=250).

Running test with original version of Enduro/X (not EPOLLEXCLUSIVE) flag set


/opt/app1/src/bankcl $ time ./bankcl
Account balance is: 422450207.99 USD
Account balance is: 509124175.43 USD
Account balance is: 142217616.65 USD
Account balance is: 461038486.08 USD
Account balance is: 2053817447.15 USD
Account balance is: 1683239018.37 USD
Account balance is: 1499130446.73 USD
Account balance is: 241376717.59 USD
Account balance is: 241376717.59 USD
Account balance is: 51450553.95 USD

real 14m20.561s
user 0m21.823s
sys 10m49.821s

Running test with patched version of kernel and Enduro/X (not EPOLLEXCLUSIVE) flag set


/opt/app1/src/bankcl $ time ./bankcl
Account balance is: 385272191.30 USD
Account balance is: 385272191.30 USD
Account balance is: 1266595989.16 USD
Account balance is: 1266595989.16 USD
Account balance is: 1266595989.16 USD
Account balance is: 1266595989.16 USD
Account balance is: 1266595989.16 USD
Account balance is: 1266595989.16 USD
Account balance is: 1266595989.16 USD
Account balance is: 1266595989.16 USD

real 0m24.953s
user 0m17.497s
sys 0m4.445s

The last returned balance of “1266595989.16” for 8x threads is related with fact that those ‘banksv’ processes did run relatively synchronous, thus  srand(time(NULL)) gave the same results.

Conclusions

We see that original version of kernel and Enduro/X in this test configuration did run 14m20.561s which is: ~860 seconds. In patched version it did run: ~24  seconds.

Thus the benefit of the patch is 3500%!!!! This is must have patch!

 

Advertisements

One thought on “EPOLLEXCLUSIVE Linux Kernel patch testing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s