During our spare time, we (Michael and I) are playing a little bit with the potential "social" web (in other words, we are just trying to extract some useful information from a bunch of bloody web pages). In that scope, we have to collect, mangle, analyze and evaluate a lot of web pages. During the process of evaluation, we could think of something new but we may forget to collect important data when crawling the urls.
We discussed the possibility to write and small lightweight framework that could operate partially like the big processing framework (e.g. : MapReduce or Hadoop). Those frameworks often operate in the same way (as a lot of operation can be expressed in that way) by splitting a large dataset in small datasets across multiple node. A map function is given to process the small datasets with user-specified operations. Afterwards the datasets are reduced and compiled to provide an uniform result. Such kind of framework is composed of multiple elements like a job scheduler systems (to dispatch the task across the nodes), a distributed file system (to efficiently distribute the datasets and also… write the results), a communication process system across the nodes… So developing such framework is very complex and could be time consuming.
In that scope, I was looking for a way to restart efficiently my crawling and processing process using a very simple process. I made a very quick-and-dirty(tm) prototype to do that job. So it's a 2 hours experiment but with the idea behind to build a potential lightweight framework… Here is the basic steps in :
I made a basic interface around GNU screen (the excellent terminal multiplexer) to handle the -X option via a simple Python class(svn). Like that the job will be managed inside a single screen session.
The second part is the webcrawler (the tasks) that will collect the urls. The webcrawler is a very simple HTTP fetcher but including the ability to retrieve the url from an url list at a specific offset. The webcrawler(svn) can be called like that :
python crawl.py -s ./urlstore2/ -f all-url.txt.100000.sampled -b 300 -e 400
Where -b is the beginning of the url to fetch and -e is the last url to fetch from the url file specified with -f.
The last part is the master managing the tasks (the multiple crawl.py to be called) :
cmd = './crawl.py' p_urllist = ' -f ' urllist = "all-url.txt.100000.sampled" p_urlstore = ' -s ' p_start = ' -b ' p_end = ' -e ' p_dacid = ' -i ' urlstore = "./urlstore2" tasknum = 25 jobname = 'crawlersample' lineNum = numLine (urllist) job = TGS.TermGnuScreen(jobname) jobid = 0 for x in range (0, lineNum, lineNum/tasknum): end = x+(lineNum/tasknum) dacid = jobname+",slave1,"+str(jobid) if end > lineNum: end = lineNum cmdline = cmd + p_urllist + urllist + p_urlstore + urlstore + p_start + str(x+1) + p_end + str(end) + p_dacid + dacid job.addWindow("jobname"+str(jobid), cmdline) jobid = jobid + 1
So if you run the master.py(svn), the tasks will be launched inside a single screen session. and you can use screen to see the evolution of each tasks.
dacid:crawlersample,slave1,24 2956 of 3999
That means that the task 24 of job crawlersample has already retrieved 2956 urls on 3999. In this experiment, we are quite lucky as we are using as simple hash-based store the crawled. As the urls are unique per task, there is no issue to use the same local repository.
It's clear that experiment could be written using multi-processes or multi-threading program, but the purpose is to be able to managed the distribution of the tasks across different nodes. Another advantage of a multiple tasks approach, we can stop a specific task without impacting the rest of the job and the other tasks.
It was just an experiment… I don't know if we could come at the end of the exercise with a more useful framework ;-) But it was useful, I discovered that GNU Screen has a hard coded limit for the number of windows per session (maximum 40 windows). You can easily change that in the config.h and recompile it.