HTTP GET and POST’s from scratch with Sockets

So I’ve done a bunch of networking research lately, and have actually decided I’m not going to continue writing my own HTTP GETter and POSTer. Instead, I’ll be using libcurl. That said, I am not that comfortable using 3rd party libraries and tools unless I have a solid understanding of how they work internally.

So this is a collection of notes on how to do HTTP GET and POST requests, in case I ever need to come back and do this from scratch (because I’m not going to remember all this). So lets get started.

Functions Needed

Dealing with HTTP, we’re going to need to interpret data in a bunch of ways. Here are a list of things not handled by socket libraries:

  • Key/Value Line Reader. Data in an HTTP header is returned as “Key: Value\n”
  • Hexadecimal String to integer conversion. “5c\n” means 92 bytes of data follow.
  • Functions for extracting parts of a URL. Host Name (blah.domain.com), Path (/somedir/somefile.php).
  • Functions for encoding and decoding as: action=list&info=Hello+World+%28Woo%21%29&num=1
  • A linked list or similar structure to collect subsequent random sized parts of a TCP stream before final data assembly.

HTTP in a nutshell

This is a good reference: http://www.jmarshall.com/easy/http/

Getting data over HTTP is a matter of exchanging HTTP headers. An HTTP header is a line delimited set of key/value pairs. Notably, the first line is not a key/value pair, but the lines that follow are. Finally, the header ends is a blank line (just a newline CRLF).

Here’s how we ask for data. Open a TCP socket, and stuff a packet containing this data in to it.

The above is called an HTTP Request Header. It’s our way of asking a webserver for data. It’s the same thing as punching the following in to a web browser:

The domain part is extracted (known as the Host Name), as is the path. The HTTP part is discarded, or rather, is what tells us that we will be exchanging data using HTTP Headers. Had the protocol been something else, like ftp://, we would be doing this a completely different way (not using HTTP headers).

Following the packet, the webserver should respond by sending you the following packet. All you need to do is receive it (well, and interpret it too):

Long story short, the above is what sending a text file containing the following looks like.

Don’t be confused by its similarity, this is a real file: http://toonormal.com/_robots.txt

That roughly summarizes all internet browser traffic. There is certainly more to it, but nearly everything is just variations of the above.

Nuances of HTTP Headers

Here’s what I know:

  • Don’t use only “\n” in Headers!! On Windows this is fine, since “\n” maps to hex 0D 0A (CR LF), but instead you should explicitly do the CR LF, so your constructed HTTP Headers are portable. You can do this with “\r\n”, or with octal codes “\015\012”. Unfortunately, this generates double CR codes when you sprintf on Windows (CRCRLF, hex 0D 0D 0A). That is fine though, since it seems web servers seem to understand this. You should always be eliminating whitespace around key/value pairs anyway, and that includes any CR and LF codes alongside your tabs and spaces.
  • HTTP Headers always end with a double newline. That means CRLF or CRCRLF. The double newline is the only way of knowing when the header ends, and the data (or chunks) begin.

Nuances of the HTTP Request Header

Here’s some more:

  • The Request Line (the 1st one) takes one of several request types. GET, POST, HEAD, PUT, DELETE, OPTIONS, and TRACE. Generally speaking, the first 3 are the only ones we care about. GET’s are what almost all HTTP requests should be. HEAD’s are exactly like GET’s, except they only return the Header part (no data). POST’s are a variation of GET that sends data alongside the Request Header, much like how an HTTP Response Header sends you files and other data. More on this later.
  • Socket connections are with IP addresses, not hosts. The point of the “Host: ” field in a HTTP Request Header is to tell the webserver where to route the request. If you happen to host multiple domain and subdomains on a web host, the “Host: ” field is what actually differentiates between each and every domain. It is a requirement of the HTTP/1.1 protocol (so if you’re feeling stupid you can change to HTTP/1.0 and omit the Host field, but that would be pointless). Without a Host field, it’s the same thing as punching in a website by its IP address.
  • “Accept-Encoding: gzip” and “Accept-Encoding: gzip, deflate” can be used to tell the webserver that you understand compressed data. So for a slight CPU hit, the data will be compressed before sending. This may mean the data needs to finish generating first as well. If your returned data is less than 200 bytes, compression may not actually be less size.
  • Google App Engine also requires that “gzip” be included in the User-Agent if it will be returning gzipped data.
  • Compression support across webhosts isn’t very reliable (at least cheap ones). You may be better off explicitly compressing, caching, and sending that instead of expecting automatic file compression to work its magic.
  • More information on Compression: http://www.http-compression.com/
  • User-Agent: http://en.wikipedia.org/wiki/User_agent

Nuances of HTTP Response Headers

And again:

  • “Transfer-Encoding: chunked” is a common encoding for text and JSON files. You will receive multiple chunks until the file finishes. Regrettably, you sometimes don’t actually know the length of file you are receiving (especially if it’s generated). If you’ve ever noticed that behavior in a web browser download (file download unknown length and time), this is what was going on. You were receiving the file and the header neglected to include a “Content-Length: “.
  • “Transfer-Encoding: gzip” is what is returned to tell you the data is coming in gzip compressed. This is a non-chunked format. From everything I’ve seen, a “Contente-Length: ” is always included. The data is raw binary, and unlikely to be finished in the first packet. Subsequent packets (I assume) contain headerless, sizeless, raw data.

Code – BSD Sockets vs WinSock2

BSD Sockets are the long time standard way to do network communications. Practically all socket libraries are derivatives of the original BSD standard.

WinSock2 is actually an implementation of BSD Sockets. It’s 80-90% functionally compatible with BSD Sockets, so if you really wanted to, you could implement both in one codebase with just some ocassional #ifdefs. That said, WinSock2 does have a plethora of other functions, but they all seem to map to BSD Socket features.

The one big difference between BSD Sockets and WinSock2 is that WinSock2 needs to be initialized before any socket code will work, and shutdown once finished.

The above code uses an atexit() callback, but if you really wanted to, you could call WSACleanup() yourself at program end. More details on WSAStartup() can be found here:

http://msdn.microsoft.com/en-us/library/windows/desktop/ms742213%28v=vs.85%29.aspx

Host Name Lookup

Sockets are connections between computers known by IP addresses. The internet however heavily uses the idea of a host name. “google.com” is a hostname. So before we can talk to a website, we need to send a request to a DNS Nameserver to have them look-up a domain name for us.

A side story on DNS nameservers, I recently set up a computer at my parents place that I could connect to remotely over the internet from my home. This required setting up specific ports on their internet router to be forwarded to the computer. Unfortunately, this meant assigning a static IP address to the computer. What sucks about this is you need to explicitly say where the internet gateway is (the router’s IP address), and the DNS Nameservers (typically assigned by your ISP, but Google shares, as do others). Normally this stuff is assigned automatically when you ask a DHCP server for an IP. If you neglect the DNS Nameserver, you can’t look up domains. Networking 101, but I’m a coder not a networking guy.

Anyways, I just wanted to talk about this to give context to it’s importance. It’s literally the backbone of what most people consider the internet. If you can’t type in a website name, then as far as you’re concerned the internet is down. I have run in to DNS issues many times in the past, even strange cases where the DNS server is down but IP traffic works fine. If you happen to have an IP cached/remembered somewhere, you can still connect. Of course, it’s not our responsibility as developers to handle DNS outages, but it’s an issue that always impresses me. The internet is “down”, yet it’s still working. Even if your own DNS server is down, but IP is still up, then the outside world can still connect to you. You can’t search for computers by domain name, but your own domain records still exist in other DNS databases.

Ahem! Now that we’re out of that rathole, this is a shockingly simple thing to do.

The HostEnt structure is an interesting way of understanding the internet.

Here’s a more featured snippet:

Here’s a couple outputs. First, my blog here.

Fairly straightforward, nothing too weird going on.

However, lets take a look at Google App Engine.

This one, what I thought was the actual host name is actually an alias. Visiting that appspot host uselessly returns me to the Google homepage, but the alias runs my Google App Engine application. I’m not going to pretend I totally understand what’s going going on behind the scenes, but what I imagine is going on is that appspot host is the real app that, based on the host name given (the alias) executes the specific users app. Google chooses to handle the ownership of the app itself (by the given host name), where as my web server (a shared host) is also the primary domain.

All that fun discovery aside, the only thing that matters to us is the h_addr_list. This is the IP address. It is stored as 4 bytes (unsigned char’s) in the case of IPv4. IPv6 is larger, but we don’t care about that right now. This is the data we need to open a socket.

Opening a Socket

Lets open.

SOCK_STREAM is the TCP streaming protocol, and SOCK_DGRAM is the UDP datagram protocol. HTTP is a TCP protocol, so we open one of those. There are actually other arguments that explicitly say “TCP” and “UDP” as the 3rd argument to create socket function, but pretty much all the code I’ve seen explicitly does not ever specify a 3rd argument. It’s weird, but there it is.

Closing a socket

…is Easy!

Sending the HTTP Request Header

Send me a website please!

Receiving the HTTP Response

Gimme!

And the output looks something like this.

However, the above code is a cheat. It’s best case scenario, serving a small HTML webpage.

In practice, if there is any delay (a generated page), you will get several fragments of a page, one after the other.

That is hardly a valid JSON file (it’s incomplete!).

This is one of the reasons why I’ve decided to use libcurl. Writing code to extract information from an HTTP header, a decoder for HTTP chunked data, a decoder for data you know the size of, and so on. The work adds up, and it also needs to be well tested. Still, this is how you communicate via the HTTP protocol. It’s not hard. But I figure, I can spend a half day recording my findings, plus no more than an hour getting libcurl working. That, or a few days building and testing all the pieces necessary, then later discovering cases I don’t handle correctly. Not to mention, the above code also assumes a perfectly clean and working internet connection. It doesn’t handle errors at all (well barely). That said, there may come a time I need to hack in/hack out a library I’m using, and I wanted to be much more aware of what actually is going on.

Sending Data via HTTP Get

You may be familiar with URL’s like the following.

Placing the strangely encoded data after a question mark in the URL is one way of passing data over HTTP. Rather, it’s the “GET” way. If you familiar with PHP, a global variable $_GET is filled with these values. These are key/value pairs. A key “q” is equal to the value “chickens”. Additional arguments are separated by an “&”. Spaces are replaced by “+”. Other symbols are replaced with % codes (%20 = space [ascii 32], %28 %29 = brackets [ascii 40,41], etc).

HTML URL encoding reference: http://www.w3schools.com/tags/ref_urlencode.asp

Punching the following URL in to a browser sends 3 variables:

The equivalent HTTP Request header is:

And the result has fed the following 3 variables to the receiving webpage.

The amount of data you can send via HTTP GET is limited by the webservers themselves. If a header is too large, they will often raise an error. The average limit is about 8k, but it may be worth keeping the header under 4k because:

  • 4k – Nginx default
  • 8k – Apache default
  • 16k – IIS default

http://stackoverflow.com/questions/686217/maximum-on-http-header-values

If you need to send more than that, you can do so via an HTTP POST.

Sending Data via HTTP Post

The above is all about sending and receiving data as an HTTP GET request. Doing other requests is a simple matter of changing the word “GET” in the first line of the HTTP Request Header to something else (i.e. “POST”).

HTTP POST’s are nearly identical to HTTP GET’s, but they now include a data section. The data section follows the same rules as when you receive data via an HTTP Response. Another way to think about it: HTTP GET Request Headers are equivalent to HTTP HEAD Response Headers (i.e. headers only). HTTP POST Request Headers are equivalent to HTTP GET Response Headers (i.e. header+data). Hopefully that’s not confusing (Requests and Responses are different). If it is, just ignore it. Generally speaking, HTTP POST is what gives us full 2 way communication over the HTTP protocol.

Lets dive in head first.

The HTTP POST Request header resembles the HTTP GET Response header more, in that we are now including a “Content-Type” of our own. The content type, “x-www-form-urlencoded” just happens to be the name for the same encoding used by HTTP GET data passed after a ? in a URL. This is how HTML forms work. If the form is a GET form, it adds it to the URL. If the form is a POST form, it places it in the data HTTP POST Request.

Notice the double newline (i.e. blank space). That means the header has ended. The rest of the file is pure data. Again, data is not limited in size like the header. Your “Content-Length” can be large if you want, and can be many many packets.

So on that note, if one was to implement large transfers over HTTP POST, you would have to similarly break up packets in to smaller parts, giving the responsibility of reassembling them to the host (just like it’s your responsibility to assemble them on your side when you HTTP GET).

Phew!

Conclusion?

That about sums up my notes on doing HTTP requests from scratch…

…But like I said, I plan to use libcurl now.

libCurl

.. is easy. Heck, the library is called easy.

http://curl.haxx.se/libcurl/c/libcurl-tutorial.html

To Init, do:

If you’re playing nice with another networking library (eNet, for example), you may actually not want to the above. There are details in the tutorial link above. I’m pretty sure the flag CURL_GLOBAL_WIN32 (part of CURL_GLOBAL_ALL) calls WSAStartup (which as mentioned in the beginning, needs to be called for sockets to work at all). So if eNet is going to do it for us, we don’t have to.

And then there are functions for adding things to the header, for encoding key/value pairs used GET/POST data, and so on.

Sounds so much nicer than doing this entirely from scratch. 🙂

Real Conclusion

I do think a background in Socket programming and HTTP requests make everything about using libcurl make way much more sense, and that is what makes this little research project of mine worthwhile. I now know what goes on inside the black box.