Theory matters

Notes from a student of Computer Science at the University of Pisa

Print all the strings made up of balanced parethesis and with lenght n

leave a comment »

From Antonio Gulli’s blog. This is Jamie‘s brilliant solution:

def par(n):
   assert (n >= 0)
   if n == 0:
       return ['']
   else:
       return ['(%s)%s' % (left,right) for i in range(0,n) for left in par(i) for right in par(n-i-1)]

Written by C.Santini

November 16, 2010 at 2:53 pm

Posted in Python, Quiz

Starting up with Django and 50$

with 17 comments


Yesterday I posted to proggit, hackernews and dzone my little weekend Python/Django project, hackurls.com, providing source code too, and I had unespected results.

I was bored of common feed readers layouts, and I couldn’t use anymore popurls.com that everytime shows billions of not Tech related news and flickering javascript tips.
So I decided to open my own version targeted to nerds and programmer, aggregating HN, proggit, hackaday and so on, to watch in one shot all the news I care. I used it for some days, adjusting little stuff here and there, and I found it very comfortable.

It’s quite amazing, in a day a hackurls got 15000 visits, 9000 unique visitors, 10 Euros of Adsense. I expected 10 visitors…
And people wrote me to tell they like it and yet use it as homepage, or to request new features they would like, or because they just needed help making my source run. A guy asked me if I could sell him the the whole piece of software. Opensource works.

Technically hackurls is just a python script running in crontab every X minutes on a Linode VPS (20$ for one month) that compiles an HTML page with a lots of news and thumbnails images got from feeds. The generated pages are served statically by a cherokee webserver. In this way a can manage a lots of users with a single vps. It costed me totally around 50$ with the domain registration.

Now I’ll try to go on improving the service, listening to users needs. Especially I’m actually trying to collect also programming jobs from many feeds; you can see a preview here: www.hackurls.com/jobs.html .

Soon I’ll include search functionality so you can find old news in all the news sources, and also find hacker jobs in tens of websites.

If you have any suggestion, write me to vuotomeccanico on gmail. Thank you all!

Written by C.Santini

March 18, 2010 at 2:09 pm

Posted in Uncategorized

Sniffing images in 350 lines

with 8 comments

For the Network Managment course, taught by professor Luca Deri (the author of the popular ntop), I wrote a little image sniffer with the pcap library.

The program listens to a network device and uses the Boyer-Moore algorithm Boyer-Moore-Horspool algorithm (thanks to Boneropolis for the correction) to retrive the starting/ending bytes of JPEG images, cutting off HTTP headers or similar from TCP packets payload, so you can have a rapid idea of what users are browsing on your network. The images are saved as N.jpg in the current directory, where N is a raising counter. The connections are stored and retrived with uthash, an open source C hash table.

Here the main C file, for the complete source download this tar.

/*
 * clasnif.c - a jpeg sniffer
 * Author: Claudio Santini (santinic@cli.di.unipi.it)
 */

#include <pcap.h>
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <time.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netinet/if_ether.h>
#include <net/ethernet.h>
#include <netinet/ether.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <netinet/udp.h>
#include <netdb.h>

#include "uthash.h"
#include "clasnif.h"

/* #define DEBUG */
#define FILTER 	"tcp port 80"
#define MAX_CONNECTIONS 256
#define MAX_LEN 65535
#define TIMEOUT_CHECK 10 /* every TIMEOUT_CHECK packets we look for dead connections */
#define TIMEOUT 5 /* after TIMEOUT seconds a connection is deleted from the hashmap */

u_char jpeg_soi[] = { 0xFF, 0xD8 };						/* Jpeg image start */
u_char jpeg_eoi[] = { 0xFF, 0xD9 };						/* Jpeg image end */
u_char http_header_end[] = { 0x0d, 0x0a, 0x0d, 0x0a };	/* CRLF (\r\n\r\n) */

conn_t * connections = NULL;						/* The hash table, see uthash.h */
int packet_offset = sizeof(struct ether_header); 	/* ip packet offset */
int counter = 0;									/* number of found images */
int packets = 1;
int timeout_counter = 1;
int datalink;

void hex(char *buf, int size)
{
#ifdef DEBUG
	if(size < 4) return;
	fprintf(stderr, ">> %x %x ... %x %x\n", (u_char)buf[0], (u_char)buf[1],
			(u_char)buf[size-2], (u_char)buf[size-1]);
#endif
}

/* Print only if DEBUG is defined */
inline int mlog(char * fmt, ...)
{
#ifdef DEBUG
	int r;
	va_list l;
	va_start(l, fmt);
	r = vfprintf(stderr, fmt, l);
	va_end(l);
	return r;
#endif
}

void save_image_list(list_t * list)
{
	FILE *file;
	char filename[256];

	sprintf(filename, "%d.jpg", counter);
	file = fopen(filename, "w");
	if(file == NULL) {
		perror("Can't open file");
		exit(EXIT_FAILURE);
	}
	while(list != NULL) {
		fwrite(list->buf, 1, list->size, file);
		list = (list_t*)list->next;
	}
	fclose(file);
	counter++;
}

/* Deallocate conection buffers */
void conn_free(conn_t * c)
{
	list_t * list, * tmp;

	if(c == NULL || c->head == NULL) return;
	for(list = c->head; list != NULL; ) {
		tmp = list;
		list = (list_t*)list->next;
		free(tmp->start);
		free(tmp);
	}
	free(c);
}

/* Deallocate dead connections */
void check_dead_connections()
{
	conn_t * c, * tmp;
    time_t now;
    double diff;

	time(&now);

	if(connections == NULL) return;
	for(c = connections; c != NULL; ) {
		diff = (time_t)(now - c->time);
		if(diff >= TIMEOUT) {
			mlog("### now=%d c->time=%d diff=%f\n", (int)now, (int)c->time, diff);
			mlog("### deleted for timeout\n");
			tmp = c->hh.next;
			HASH_DEL(connections, c);
			conn_free(c);
			c = tmp;
		}
		else {
			c = c->hh.next;
		}
    }
}

/* Finds substring needle in haystack with the Boyer-Moore-Horspool algorithm */
static unsigned char * memstr(const unsigned char *haystack, const size_t hlen,
		const unsigned char *needle, const size_t nlen)
{
    int skip[256], k=0;
    if (nlen == 0) return (unsigned char*)haystack;

    for(k=0; k < 255; ++k) skip[k] = nlen;
    for(k=0; k < nlen - 1; ++k) skip[needle[k]] = nlen - k - 1;

    for(k = nlen - 1; k < hlen; k += skip[haystack[k]]) {
        int i, j;
        for(j = nlen - 1, i = k; j >= 0 && haystack[i] == needle[j]; j--) i--;
        if(j == -1) return (unsigned char*)(haystack + i + 1);
    }
    return NULL;
}

void process_packet(u_char *args, const struct pcap_pkthdr *header, const u_char *packet)
{
	struct ether_header *ether;			/* ethernet header */
	struct ip *ip;						/* ip header */
	struct tcphdr *tcp;					/* tcp header */
	int off;
	int hlen;							/* lenght of the header (eth + ip + tcp)*/
	int datalen;						/* length of packet data */
	int size;
	conn_t *conn;						/* generic connection struct */
	list_t *list, *new;
	unsigned char *start, *end, *data;	/* memstr return value */
	hashkey_t key;						/* key for the hasmap */

#ifdef DEBUG
	mlog("packet %d\n", packets);
	packets++;
#endif

	timeout_counter++;
	if(timeout_counter == TIMEOUT_CHECK) {
		check_dead_connections();
		timeout_counter = 1;
	}

	off = ((113 == datalink) ? 2 : 0);
	ether = (struct ether_header*)(packet+off);
	if(ntohs(ether->ether_type) == ETHERTYPE_IP) {
		hlen = sizeof(struct ether_header) + off;
		ip = (struct ip*)(packet + hlen);
		hlen += sizeof(struct ip);
		tcp = (struct tcphdr*) (packet + hlen);
		hlen = hlen + (tcp->doff << 2);
		datalen = header->len - hlen;

		if(datalen < 80) {
			/* This is not for sure an HTTP packet with interesting data */
			mlog("----------------------------------\n");
			return;
		}

		mlog("ether: "); hex((char*)ether, sizeof(struct ether_header));
		mlog("ip: "); 	hex((char*)ip, sizeof(struct ip));
		mlog("tcp: ");	hex((char*)tcp, sizeof(struct tcphdr));
		mlog("header->len=%d hlen=%d datalen=%d\n", header->len, hlen, datalen);

		if(datalen > 0) {
			data = (unsigned char*)malloc(datalen);
			if (data == NULL) {
				perror("Cannot alloc memory");
				exit(EXIT_FAILURE);
			}
			memcpy(data, packet + hlen, datalen);
			mlog("data: ");	hex((char*)data, datalen);
			/* Generate the key for the hashmap */
			key.src = ip->ip_src;
			key.dst = ip->ip_dst;
			key.src_port = tcp->source;
			key.dst_port = tcp->dest;
		    HASH_FIND(hh, connections, &key, sizeof(hashkey_t), conn);
			if(conn == NULL) {
				/* If we find jpeg starting bytes we create a new connection */
				start = memstr(data, datalen, jpeg_soi, 2);
				if(start != NULL) {
					conn = (conn_t*)malloc(sizeof(conn_t));

					if(data[datalen-2] == jpeg_eoi[0] && data[datalen-1] == jpeg_eoi[1]) {
						/* All the (little) image is contained in one segment */
						mlog("tutta l'immagine in un segmento\n");
						conn->head = (list_t*)malloc(sizeof(list_t));
						if(conn->head == NULL) {
							perror("Cannot alloc memory");
							exit(EXIT_FAILURE);
						}
						memcpy(&conn->key, &key, sizeof(hashkey_t));
						conn->head->size = data+datalen - start;
						conn->head->buf = start;
						conn->head->next = NULL;
						conn->head->start = data;
						time(&(conn->time));
						save_image_list(conn->head);
						fprintf(stderr, "Image saved in %d.jpg\n", counter-1);
						conn_free(conn);
					}
					else {
						memcpy(&conn->key, &key, sizeof(hashkey_t));
					    HASH_ADD(hh, connections, key, sizeof(hashkey_t), conn);
						conn->head = (list_t*)malloc(sizeof(list_t));
						if(conn->head == NULL) {
							perror("Cannot alloc memory");
							exit(EXIT_FAILURE);
						}
						conn->head->next = NULL;
						conn->head->buf = start;
						conn->head->start = data;
						conn->head->size = data+datalen - start;
						mlog( "salvo: ");
						hex((char*)conn->head->buf, conn->head->size);
						time(&(conn->time));
					}
				}
			}
			else {
				start = memstr(data, datalen, http_header_end, sizeof(http_header_end));
				if(start != NULL) {
					mlog("http header end found at %d\n", (int)start);
					start += sizeof(http_header_end);
				}
				else {
					start = data;
				}

				end = memstr(data, datalen, jpeg_eoi, 2);
				end += 2; /* jpeg_eoi bytes */

				/* If jpeg_eoi bytes are not the end of the segment, then
				 * "end" is not the real end of the image */
				if(end == NULL || data+datalen != end) {
					/* If we don't find the end of the image we continue storing
					 * untill MAX_IMAGE bytes */
					mlog("C'è la connesione ma non l'eoi size=%d\n", datalen);
					/* Go through the list untill the end */
					list = conn->head;
					while(list->next != NULL) {
						list = (list_t*)list->next;
					}
					new = (list_t*)malloc(sizeof(list_t));
					list->next = (struct list_t*)new;
					new->buf = start;
					new->start = data;
					new->size = data+datalen - start;
					mlog("salvo in lista: ");
					hex((char*)new->buf, new->size);
					time(&(conn->time));
					new->next = NULL;
				}
				else {
					mlog("fine dell'immagine data=%d end=%d data+datalen=%d\n", (int)data, (int)end, (int)(data+datalen));
					list = conn->head;
					while(list->next != NULL) {
						list = (list_t*)list->next;
					}
					new = (list_t*)malloc(sizeof(list_t));
					if(new == NULL) {
						perror("Cannot alloc memory");
						exit(EXIT_FAILURE);
					}
					size = end - start;
					new->buf = (u_char*)malloc(size);
					new->start = new->buf;
					if(new->buf == NULL) {
						perror("Cannot alloc memory");
						exit(EXIT_FAILURE);
					}
					memcpy(new->buf, data, size);
					free(data);
					new->size = size;
					new->next = NULL;
					list->next = (struct list_t*)new;
					save_image_list(conn->head);
					printf("Image saved in %d.jpg\n", counter-1);
					HASH_DEL(connections, conn);
					conn_free(conn);
				}
			}
		}
	}
	mlog("----------------------------------\n");
}

int main(int argc, char *argv[])
{
	char *dev;								/* Device to sniff on */
	char errbuf[PCAP_ERRBUF_SIZE];			/* Error string */
	pcap_t *handle;							/* Session handle */
	struct bpf_program filter;				/* The compiled filter */

	if(argc > 1) {
		dev = argv[1];
	}
	else {
		dev = pcap_lookupdev(errbuf);
		if (dev == NULL) {
			fprintf(stderr, "Couldn't find default device: %s\n", errbuf);
			return(2);
		}
	}
	printf("Device: %s\n", dev);
	handle = pcap_open_live(dev, MAX_LEN, 0, 10000, errbuf);
	if(handle == NULL) {
        if(getuid() != 0)
            fprintf(stderr, "Perhaps you need to be root?\n");
		else
			fprintf(stderr, "Couldn't open device %s: %s\n", dev, errbuf);
		return (2);
	}
	if(pcap_compile(handle, &filter, FILTER, 1, 0) == -1) {
		fprintf(stderr, "Couldn't parse filter %s: %s\n", FILTER,
				pcap_geterr(handle));
		return (2);
	}
	if(pcap_setfilter(handle, &filter) == -1) {
		fprintf(stderr, "Couldn't install filter %s: %s\n", FILTER,
				pcap_geterr(handle));
		return (2);
	}
	datalink = pcap_datalink(handle);
	pcap_loop(handle, -1, process_packet, NULL);
	return(0);
}

Written by C.Santini

January 23, 2010 at 11:25 am

Posted in C

Why is e^(pi*i) = -1 ?

leave a comment »

Written by C.Santini

October 30, 2009 at 5:55 pm

Posted in Uncategorized

Teachers matter, not shiny new buildings

leave a comment »

Dr. Tae — Building A New Culture Of Teaching And Learning from Dr. Tae on Vimeo.

Learning = Work your ass off until you figure it out

Written by C.Santini

July 11, 2009 at 12:55 pm

Posted in Uncategorized

HTML Pattern Matching in Python

with 9 comments

As if I would’t have enough to do, I started playing with Python, and I loved it: It’s kept simple, it’s powerful, and you don’t have to learn it. I started programming seriously in Python after two hours of command line tests. Everyone who has seen a programming language yet knows Python.I really felt the absence of serious functional programming support and a kind of OCaml-like matching but… it’s another story.

I often need to grasp data from html pages, and I often do that with Java and HTMLParser… but this time I tried with Python because one of my friend told me how beautiful the soup is.

When facing the problem of getting a lot of data from html pages, I always dream a sort of pattern matching for HTML: I want to be able to get “this kind of tag sequence” inside a page. So I created a library to do that for me based on BeautifulSoup, and it’s impressively effective and easy to use!

Theory

An HTML page has a tree of nodes, called DOM, where each node is a tag or a string (this is how BeautifulSoup see it). So we have the DOM of a page and the tree of a pattern, and we want to find all the occurrences of that pattern in the page.

HTML Pattern

Now the question is: how do we find that two nodes matches ? Well, for my purposes I defined that two nodes are similar if they have the same tag name and if the input node (from the page) matches AS STRING, at least all the attributes of the pattern node. Ok, what does mean that two strings matches ? I defined a simple syntax that worked good for all my targets, and you’ll see it in the following example.

Practice

Let’s suppose you have a page with a list of videos (page.html), and you want to get all the videos:

<html>
<head><title>Example</title></head>
<body>
<div class="video">
		<a href="watch?v=0001">Title first video</a><img src="preview1.jpg"/></div>
<div class="video">
		<a href="watch?v=0002">Title second video</a><img src="preview2.jpg"/></div>
<div class="video">
		<a href="watch?v=0003">Title third video</a><img src="preview3.jpg"/></div>
...
</body>
</html>

You just have to copy the tags structure of one video and give it to my python script as pattern, substituting the variables you want, like this (pattern.html):

<div class="video"><a href="watch?v=$code$">$title$</a><img src="$preview$"/></div>

So, just put $variable$ where you want (be as much restrictive as you can to avoid ambiguities). Now if you run the script you get:

claudio@laptop:~$ ./htmlmatch.py index.html pattern.html
code: 0001
title: The first video
preview: preview1.jpg

code: 0002
title: The second video
preview: preview2.jpg

code: 0003
title: The third video
preview: preview3.jpg

.

You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:

page = urllib2.urlopen("http://www.your_video_website.com/")
pattern = open("pattern.html", "r")
matches = htmlmatch(page, pattern)
for map in matches:
	for k, v in map.iteritems():
		print k, v
	print

The Source

I suggest you to use BeautifulSoup version 3.0.7a, because it behaves better with real world HTML (newer versions are based upon the new python HTML parser which is not very improved).

Click here to download htmlmatch-0.1.py as file

#!/usr/bin/env python

import sys
import urllib2
from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import HTMLParser

def htmlmatch(page, pattern):
	"""Finds all the occurrencies of the pattern tree into the given html page"""
	isoup = BeautifulSoup(page)
	psoup = BeautifulSoup(pattern)

	def untiltag(gen):
		node = gen.next()
		while True:
			if isinstance(node, Tag):
				break
			elif len(node.lstrip()) == 0:
				node = gen.next()
			else:
				break
		return node

	pgen = psoup.recursiveChildGenerator()
	pnode = untiltag(pgen)
	igen = isoup.recursiveChildGenerator()
	inode = untiltag(igen)

	variables = []
	lastvars = {}

	while True:
		newvars = nodematch(inode, pnode)
		if newvars != None:
			if len(newvars) > 0:
				lastvars.update(newvars)
			try:
				pnode = untiltag(pgen)
			except StopIteration:
				pgen = psoup.recursiveChildGenerator()
				pnode = untiltag(pgen)
				if len(lastvars) > 0:
					variables.append(lastvars)
					lastvars = {}
		else:
			pgen = psoup.recursiveChildGenerator()
			pnode = untiltag(pgen)
		try:
			inode = untiltag(igen)
		except StopIteration:
			if variables != None:
				return variables
			return None
	return variables

def nodematch(input, pattern):
	"""Matches two tags: returns True if the tags are of the same kind, and if
	the first tag has AT LEAST all the attributes of the second one
	(the pattern) and if these attributes match as strings, as defined in
	strmatch function."""
	if input.__class__ != pattern.__class__:
		return None
	if isinstance(input, NavigableString):
		return strmatch(input, pattern)
	if isinstance(input, Tag) and input.name != pattern.name:
		return None
	variables = {}
	for attr, value in pattern._getAttrMap().iteritems():
		if input.has_key(attr):
			newvars = strmatch(input.get(attr), value)
			if newvars != None:
				variables.update(newvars)
			else:
				return None
		else:
			return None
	return variables

def strmatch(input, pattern):
	"""Matches the input string with the pattern string. For example:

	input: 	 "python and ocaml are great languages"
	pattern: "$lang1$ and $lang2$ are great languages"

	gives as output the map:
	{"lang1": "python", "lang2": "ocaml"}

	The function returns None if the strings don't match."""
	var, value = None, None
	i, j = 0, 0
	map = {}
	input_len = len(input)
	pattern_len = len(pattern)
	while i < input_len:
		if var == None:
			if pattern[j] == '$':
				var = ""
				value = ""
				j += 1
			elif input[i] != pattern[j]:
				return None
			else:
				i += 1
				j += 1
		else:
			while pattern[j] != '$':
				var += pattern[j]
				j += 1;
			j += 1
			if j == pattern_len:
				while i < input_len:
					value += input[i]
					i += 1
			else:
				while i < input_len and input[i] != pattern[j]:
					value += input[i]
					i += 1
			i +=1
			j +=1
			map[var] = value
			var = None
	return map

def main(argv):
	if len(argv) < 2:
		print "example: ./htmlmatch.py input.html pattern.html"
		return
	page = open(argv[0], "r")
	pattern = open(argv[1], "r")
	l = htmlmatch(page, pattern)
	for m in l:
		for k, v in m.iteritems():
			print k, v
		print

if __name__ == "__main__":
    main(sys.argv[1:])

Written by C.Santini

June 4, 2009 at 4:19 pm

JavaFX Crayon Physics

with 4 comments

I definitely overestimated my free time :)

In these months, while studying for exams, I’ve started programming a JavaFX clone of Crayon Physics, an impressive game based on the 2D physical engine Box2D made by Erin Catto. The goal was to have a final project for the Graphical Interfaces course, and to participate the latest JavaFX contest.

Try Pastello!

I’ve uploaded the preview version of the game with some playable levels that I’ve submitted to the contest. I called it “Pastello” (the italian for crayon).

pastello

pastello3

The physics engine

I tried two different Java physics engines: JBox2d and Phys2D but, in the end, I used Phys2d because it’s more Java-like (JBox2d maintains the same Box2d C++ structure), and because have better convex polygon support (but it’s quite slower, as shown here: http://ciardhubh.de/node/15).

I toke inspiration from the JavaFX-Phys2D interaction shown in this post, but I extended the original idea of PhysicalObject/PhysicalScene, making it a bit more flexible with an object hierarchy of different physical shapes, which I think can be easily reused as it is. But the fundamental idea is just the same: there is a JavaFX timeline that periodically calls the step() function of the physical world model and updates the graphical rapresentation of all the objects.

The level editor

Yes, there is a level editor that lets you create levels and load/store them as JSON files (I’m nauseated by XML -_- ), for example this is the first level generated by the editor:

[{"name":"Text","x":220,"y":120,"content":"text..."},
{"x":120,"y":130,"name":"Ball", "isFake":true, "radius":40},
{"w":588,"isStatic":true,"name":"Box","h":67,"y":373.5,"x":392},
{"isStatic":false,"name":"Ball","radius":20,"isRedBall":true,"y":320.010009765625,"x":158},
{"name":"Flag","y":304,"x":621}]

To do that I used the json.org java library because it’s very easy to use (also from JavaFX) and I was not very satisfied by JavaFX PullParser.

About JavaFX

I’ve found some difficulties, but I’ve been also very surprised by elegant solution you can acheive with functional programming applied to graphics. An example:

var visualPolygon = Polygon {
	points: bind [
		firstPick.x, firstPick.y,
		curPick.x, curPick.y,
		if(curPick.y > firstPick.y) [firstPick.x, curPick.y]
		else [curPick.x, firstPick.y]
	]
}

This means that visualPolygon is a Polygon (JavaFX graphical node) whose points are defined by a Sequence (a sort of JavaFX mutable list) made like this: s = (x_0, y_0, x_1, y_1, ..., x_n, y_n). The funny fact is that the third couple of points is defined by an if…else statement (a function). The funniest fact is that the points are bound with the Sequence expression: every time any of the involved variables are changed (firstPick and curPick), the vertices of the polygon change. Binding is not efficent, and should be used as less as possible, maybe because it has not yet been improved in JavaFX, but it is extremely practical.

Credits

Written by C.Santini

May 28, 2009 at 4:14 pm

Posted in JavaFX

Tagged with , ,

Follow

Get every new post delivered to your Inbox.