Don't forget it's only a hyper draft open to be discussed
The protocol is working in two steps. An initial step to get an uid if the extension has none or if the user want to regenerate a new uid. When requesting the uid (confidentiality should be assured), the server is giving the uid and a common shared secret linked to the uid.
Following the HMAC approach used, we only face the problem of replaying the same rating for a website. The problem is normalized when the scoring is "lifted" per GUID. You can only replay existing rating.
There are two approaches: a direct one and an indirect one.
The direct approach will add an additional security layer for generating uid's, e.g. by requiring the user to enter a CAPTCHA while requesting a new uid, or by requiring the user to register.
The indirect approach will blacklist those IP addresses, which generate a lot of uid's over time and whose uid's are not used to submit ratings. The indirect approach (fix the symptom but not the problem) has several disadvantages though: we still waste uid's unnecessarily until the fake requester is blacklisted, and we end up in an arms race with the abusers.
A hybrid approach could be an automated challenge/response step (without a need for user intervention), which is quite heavy on the computational side so that it will hinder attacks of UID abusers.
First, our basic assumption is that human users will report web pages to the central service, and not automated applications. In particular, we don't want automated applications to file user reports.
Thus it's reasonable to assume that a human user cannot submit more than X ratings per Y time period (just consider the time to download a web page, view/identify it and time needed to click the button in order to rate it). We can use this as a starting point to identify rating bots* and to identify patterns for incorrect or irregular user behavior, e.g. identifying mis-clicks such as a user wating to rate a web page as A but clicking the rate-as-B button, and then immediately clicking the rate-as-A button to "correct" the previous error).
*Of course, intelligent rating bots could simply be patient and submit ratings slowly to the service. These subtle bots are a much bigger problem just like small "incorrections" are much harder to identify on Wikipedia than big vandalism acts.
See question How do you deal with large submissions of ratings from the same uid? above. In this case, we will assume that a mis-click happened and therefore just count the last rating, dropping all previous ratings of the URL from the same UID in the set of recent rating submissions.
Example: We receive four rating submissions for the same URL x by UID y in one minute. We discard all but the last rating submission for URL x, and drop the previous three ones. All previous ratings already stored in the service database will not be affected (at least at this point).
Important use SSL/TLS for the ID request
POST or GET are valids.
./uid.pl?action=create
uid=B0470602-A64B-11DA-8632-93EBF1C0E05A; key=itMzPcvEJyLk5ZDfA3Ce2Tknsske6z4rsxy1axZmof0=
(represented as a GET but POST is used)
Version string: "1.0"
./add-rating.pl?uid=B0470602-A64B-11DA-8632-93EBF1C0E05A&url=urlsafeBase64(url)&class=foobar&vote=p &auth=HMAC(uid+urlsafeBase64(url)+class+vote)&protocol=1.0&client=firefox-extension-1.0
(the client parameter is recommended but optional)
We use a basic SQL table to store the "verified" rating. The format of the table is the following (until now, not space efficient) :
CREATE TABLE rating ( class TEXT, vote TEXT, url TEXT, uid TEXT, ip TEXT, client TEXT, protocol INTEGER, referer TEXT, datesubmit TIMESTAMP, source TEXT, state INTEGER );
The pre-Rating storage is using a temporary storage that will be used to feed the "final" storage. A process is querying entries from the temporary storage and change the proc field with a specific value. Another process is reading the table to find the proc field with a specific value and delete the records. Nothing more.
→ if all the tests are successfull, we allow the data to be pre-stored.