The m3.medium is terrible

I’ve been doing some testing of various instance types in our staging environment, originally just to see if Amazon’s t2.* line of instances is usable in a real-world scenario. In the end, I found that not only are the t2.mediums viable for what I want them to do, but they’re far better suited than the m3.medium, which I wouldn’t use for anything that you ever expect to reach any load.

Here are the conditions for my test:

  • Rails application (unicorn) fronted by nginx.
  • The number of unicorn processes is controlled by chef, currently set to (CPU count * 2), so a 2 CPU instance has 4 unicorn workers.
  • All instances are running Ubuntu 14.04 LTS (AMI ami-864d84ee for HVM, ami-018c9568 for paravirtual) with kernel 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64.
  • The test used loader.io to simulate 65 concurrent clients hitting the API (adding products to cart) as fast as possible for 600 seconds (10 minutes).
  • The instances were all behind an Elastic Load Balancer, which routes traffic based on its own algorithm (supposedly the instances with the lowest CPU always gets the next request).

The below charts summarize the findings.

average nginx $request_time
average nginx $request_time

This chart shows each server’s performance as reported by nginx. The values are the average time to service each request and the standard deviation. While I expected the m3.large to outperform the m3.medium, I didn’t expect the difference to be so dramatic. The performance of the t2.medium is the real surprise, however.

#	_sourcehost	_avg	_stddev
1	m3.large	6.30324	3.84421
2	m3.medium	15.88136	9.29829
3	t2.medium	4.80078	2.71403

These charts show the CPU activity for each instance during the test (data as per CopperEgg).

m3.large
m3.large
t2.medium
t2.medium
m3.medium
m3.medium

The m3.medium has a huge amount of CPU steal, which I’m guessing accounts for its horrible performance. Anecdotally, in my own experience m3.medium far more prone to CPU steal than other instance types. Moving from m3.medium to c3.large (essentially the same instance with 2 cpus) eliminates the CPU steal issue. However, since the t2.medium performs as well as the c3.large or m3.large and costs half of the c3.large (or nearly 1/3 of the m3.large) I’m going to try running most of my backend fleet on t2.medium.

I haven’t mentioned the credits system the t2.* instances use for burstable performance, and that’s because my tests didn’t make much of a dent in the credit balance for these instances. The load test was 100x what I expect to see in normal traffic patterns, so the t2.medium with burstable performance seems like an ideal candidate. I might add a couple c3.large to the mix as a backstop in case the credits were depleted, but I don’t think that’s a major risk – especially not in our staging environment.

Edit: I didn’t include the numbers, but the performance seemed to be the consistent whether on hvm or paravirtual instances.

Using OpenSWAN to connect two VPCs in different AWS regions

Amazon has a pretty decent writeup on how to do this (a href=”https://aws.amazon.com/articles/5472675506466066″>here), but in trying to establish Postgres replication across regions, I found some weird behavior where I could connect to the port directly (telnet to 5432) but psql (or pg_basebackup) didn’t work. tcpdump showed this:

16:11:28.419642 IP 10.121.11.47.35039 > 10.1.11.254.postgresql: Flags [P.], seq 9:234, ack 2, win 211, options [nop,nop,TS val 11065893 ecr 1811434], length 225
16:11:28.419701 IP 10.121.11.47.35039 > 10.1.11.254.postgresql: Flags [P.], seq 9:234, ack 2, win 211, options [nop,nop,TS val 11065893 ecr 1811434], length 225
16:11:28.421186 IP 10.1.11.254.postgresql > 10.121.11.47.35039: Flags [.], ack 234, win 219, options [nop,nop,TS val 1811520 ecr 11065893,nop,nop,sack 1 {9:234}], length 0
16:11:28.425273 IP 10.1.11.254.postgresql > 10.121.11.47.35039: Flags [P.], seq 2:1377, ack 234, win 219, options [nop,nop,TS val 1811522 ecr 11065893], length 1375
16:11:28.425291 IP 10.1.96.20 > 10.1.11.254: ICMP 10.121.11.47 unreachable - need to frag (mtu 1422), length 556
16:11:28.697397 IP 10.1.11.254.postgresql > 10.121.11.47.35039: Flags [P.], seq 2:1377, ack 234, win 219, options [nop,nop,TS val 1811590 ecr 11065893], length 1375
16:11:28.697438 IP 10.1.96.20 > 10.1.11.254: ICMP 10.121.11.47 unreachable - need to frag (mtu 1422), length 556
16:11:29.241311 IP 10.1.11.254.postgresql > 10.121.11.47.35039: Flags [P.], seq 2:1377, ack 234, win 219, options [nop,nop,TS val 1811726 ecr 11065893], length 1375
16:11:29.241356 IP 10.1.96.20 > 10.1.11.254: ICMP 10.121.11.47 unreachable - need to frag (mtu 1422), length 556
16:11:30.333438 IP 10.1.11.254.postgresql > 10.121.11.47.35039: Flags [P.], seq 2:1377, ack 234, win 219, options [nop,nop,TS val 1811999 ecr 11065893], length 1375
16:11:30.333488 IP 10.1.96.20 > 10.1.11.254: ICMP 10.121.11.47 unreachable - need to frag (mtu 1422), length 556
16:11:32.513418 IP 10.1.11.254.postgresql > 10.121.11.47.35039: Flags [P.], seq 2:1377, ack 234, win 219, options [nop,nop,TS val 1812544 ecr 11065893], length 1375
16:11:32.513467 IP 10.1.96.20 > 10.1.11.254: ICMP 10.121.11.47 unreachable - need to frag (mtu 1422), length 556
16:11:36.881409 IP 10.1.11.254.postgresql > 10.121.11.47.35039: Flags [P.], seq 2:1377, ack 234, win 219, options [nop,nop,TS val 1813636 ecr 11065893], length 1375
16:11:36.881460 IP 10.1.96.20 > 10.1.11.254: ICMP 10.121.11.47 unreachable - need to frag (mtu 1422), length 556

After quite a bit of Google and mucking in network ACLs and security groups, the fix ended up being this:

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1500

(The above two commands need to be run on both OpenSwan boxes.)

“You have to install development tools first.” – OSX Mavericks, ruby, chef, and nokogiri

I was trying to get knife ec2 working on my Mac, but even though my system Ruby was at 2.0.0, the embedded Ruby that chef/knife use (in /opt/chef/embedded/bin) was 1.9.1. Installing knife-ec2 should just be a matter of typing “gem install knife-ec2″ but due to some weird issues with nokogiri, I burned about 4 hours trying to make it work. I tried everything I could find – installing iconv, libxml2, and libxslt via brew and telling “gem install” to use the custom libs in /usr/local/Cellar was the most common suggestion on StackOverflow – but nothing worked. What ended up fixing it for me was reinstalling chef. :|

[evan@Evan ~] $ sudo /opt/chef/embedded/bin/gem install nokogiri
Building native extensions.  This could take a while...
Building nokogiri using packaged libraries.
ERROR:  Error installing nokogiri:
	ERROR: Failed to build gem native extension.
 
        /opt/chef/embedded/bin/ruby extconf.rb
Building nokogiri using packaged libraries.
checking for iconv.h... *** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers.  Check the mkmf.log file for more
details.  You may need configuration options.
 
Provided configuration options:
	--with-opt-dir
	--with-opt-include
	--without-opt-include=${opt-dir}/include
	--with-opt-lib
	--without-opt-lib=${opt-dir}/lib
	--with-make-prog
	--without-make-prog
	--srcdir=.
	--curdir
	--ruby=/opt/chef/embedded/bin/ruby
	--help
	--clean
	--use-system-libraries
	--enable-static
	--disable-static
	--with-zlib-dir
	--without-zlib-dir
	--with-zlib-include
	--without-zlib-include=${zlib-dir}/include
	--with-zlib-lib
	--without-zlib-lib=${zlib-dir}/lib
	--enable-cross-build
	--disable-cross-build
/opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:381:in `try_do': The compiler failed to generate an executable file. (RuntimeError)
You have to install development tools first.
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:506:in `try_cpp'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:931:in `block in have_header'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:790:in `block in checking_for'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:284:in `block (2 levels) in postpone'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:254:in `open'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:284:in `block in postpone'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:254:in `open'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:280:in `postpone'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:789:in `checking_for'
	from /opt/chef/embedded/lib/ruby/1.9.1/mkmf.rb:930:in `have_header'
	from extconf.rb:103:in `have_iconv?'
	from extconf.rb:148:in `block (2 levels) in iconv_prefix'
	from extconf.rb:90:in `preserving_globals'
	from extconf.rb:143:in `block in iconv_prefix'
	from extconf.rb:116:in `block in each_iconv_idir'
	from extconf.rb:113:in `each'
	from extconf.rb:113:in `each_iconv_idir'
	from extconf.rb:137:in `iconv_prefix'
	from extconf.rb:428:in `block in <main>'
	from extconf.rb:161:in `block in process_recipe'
	from extconf.rb:154:in `tap'
	from extconf.rb:154:in `process_recipe'
	from extconf.rb:423:in `</main><main>'
 
 
Gem files will remain installed in /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/nokogiri-1.6.3.1 for inspection.
Results logged to /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/nokogiri-1.6.3.1/ext/nokogiri/gem_make.out
[evan@Evan ~] $
</main>

Well, this is apparently an indication that you don’t have the command-line dev tools installed on your computer. However, in Mavericks, according to Apple:

If Xcode is installed on your machine, then there is no need to install them. Xcode comes bundled with all your command-line tools. OS X 10.9 includes shims or wrapper executables. These shims, installed in /usr/bin, can map any tool included in /usr/bin to the corresponding one inside Xcode. xcrun is one of such shims, which allows you to find or run any tool inside Xcode from the command line. Use it to invoke any tool within Xcode from the command line.

wat?

I spent several hours trawling through StackExchange, Googling for every combination of nokogiri, mavericks, chef, xcode. Here are some of my searches from today:

  • https://www.google.com/search?q=%22You+have+to+install+development+tools+first.%22&hl=en
  • https://www.google.com/search?q=osx+mavericks+command+line+tools&hl=en
  • https://www.google.com/search?q=%22The+compiler+failed+to+generate+an+executable+file.%22+%22You+have+to+install+development+tools+first.%22+nokogiri&hl=en
  • https://www.google.com/search?q=mavericks+nokogiri+%22iconv.h%22&hl=en
  • https://www.google.com/search?q=chef+embedded+nokogiri+iconv&hl=en
  • https://www.google.com/search?q=mavericks+nokogiri+%22iconv.h%22+chef+embedded&hl=en
  • https://www.google.com/search?q=nokogiri+%22The+compiler+failed+to+generate+an+executable+file.+(RuntimeError)%22&hl=en

How did I end up fixing it? Two things:

  1. In ~/.bashrc, add export PATH=/opt/chef/embedded/bin:$PATH
  2. Reinstall chef: curl -L https://www.getchef.com/chef/install.sh | sudo bash

After reinstalling chef (which installed an embedded Ruby 1.9.3 – my old version was 1.9.1), this command ran successfully:

$ sudo gem install -V --no-rdoc --no-ri nokogiri

Full output below:
Continue reading

OpenVPN CLI Cheat Sheet

Adding a regular user called testing

/usr/local/openvpn_as/scripts/sacli -u testing -k type -v user_connect UserPropPut

Add an autologin user called knock

/usr/local/openvpn_as/scripts/sacli -u knock -k prop_autologin -v true UserPropPut

Add an admin user called admin

/usr/local/openvpn_as/scripts/sacli -u admin -k prop_superuser -v true UserPropPut; /etc/init.d/openvpnas restart

Allow user testing to networks 192.168.0.0/24 and 10.0.0.0/16 via NAT

/usr/local/openvpn_as/scripts/sacli -u testing -k access_to.0 -v +NAT:192.168.0.0/24 UserPropPut; /usr/local/openvpn_as/scripts/sacli -u testing -k access_to.1 -v +NAT:192.168.0.0/16 UserPropPut; /usr/local/openvpn_as/scripts/sacli start

Allow user testing to networks 192.168.0.0/24 and 10.0.0.0/16 via ROUTE

/usr/local/openvpn_as/scripts/sacli -u testing -k access_to.0 -v +ROUTE:192.168.0.0/24 UserPropPut; /usr/local/openvpn_as/scripts/sacli -u testing -k access_to.1 -v +ROUTE:192.168.0.0/16 UserPropPut; /usr/local/openvpn_as/scripts/sacli start

Remove access to network entry 0 and 1 for user testing

/usr/local/openvpn_as/scripts/sacli -u testing -k access_to.0 UserPropDel; /usr/local/openvpn_as/scripts/sacli -u testing -k access_to.1 UserPropDel; /usr/local/openvpn_as/scripts/sacli start

Get installer with profile for user, in this case autologin

./sacli –user testing AutoGenerateOnBehalfOf
./sacli –user testing –key prop_autologin –value true UserPropPut
./sacli –itype msi –autologin -u testing -o installer_testing/ GetInstallerEx

Get separate certificate files for user, for open source applications

./sacli -o ./targetfolder –cn test Get5

Get unified (.ovpn file) for user, for Connect Client for example

./sacli -o ./targetfolder –-cn test Get1

Show all users in user database with all their properties

./confdba -u -s

Show only a specific user in user database with all properties

./confdba -u –prof testuser -s

Remove a user from the database, revoke his/her certificates, and then kick him/her off the server

./confdba -u –prof testing –rm
./sacli –user testing RevokeUser
./sacli –user testing DisconnectUser

Set a password on a user from the command line, when using LOCAL authentication mode:

./sacli –user testing –new_pass passwordgoeshere SetLocalPassword

Enable Google Authenticator for a user:

./sacli --key vpn.server.google_auth.enable --value true ConfigPut

 

Create CloudWatch alerts for all Elastic Load Balancers

I manage a bunch of ELBs but we were missing an alert on a pretty basic metric: how many errors the load balancer was returning.  Rather than wade through the UI to add these alerts I figured it would be easier to do it via the CLI.

Assuming aws-cli is installed and the ARN for your SNS topic (in my case, just an email alert) is $arn:

for i in `aws elb describe-load-balancers | grep LoadBalancerName | \
perl -ne 'chomp; my @a=split(/\s+/); $a[2] =~ s/[\"\,]//g ; print "$a[2] ";' ` ; \
do aws cloudwatch put-metric-alarm --alarm-name "$i ELB 5XX Errors" --alarm-description \
"High $i ELB 5XX error count" --metric-name HTTPCode_ELB_5XX --namespace AWS/ELB \
--statistic Sum --period 300 --evaluation-periods 1 --threshold 50 \
--comparison-operator GreaterThanThreshold --dimensions Name=LoadBalancerName,Value=$i \
--alarm-actions $arn --ok-actions $arn ; done

That huge one-liner creates a CloudWatch notification that sends an alarm when the number of 5XX errors returned by the ELB is greater than 50 over 5 minutes, and sends an “ok” message via the same SNS topic. The for loop creates/modifies the alarm for every ELB.

More info on put-metric-alarm available in the AWS docs.