According to Christina Warren at mashable.com, the switch to allowing non-Latin alphabet characters in web domains could give scammers a brand new toolkit. That's because browsers can't render many non-Latin characters, and the approximations may be doppelgangers for trusted sites. Alternatively, an address in an alphabet like Cyrillic, which shares certain letterforms with the Latin alphabet, can appear indistinguishable from pre-existing Latin-alphabet addresses:
Uh-oh.
It's only fair that users of different alphabets get to register their own addresses, but clearly there needs to be some kind of fix here, either from ICANN or from the tech side. But Warren cites a TimesOnline article from two days ago in which it sounds like no one is really taking a hard look at solutions.
- Log in to post comments
In fact, people have been working on this since at least as far back as 2001.
Also, take note at who exactly is being quoted in those articles; the TimesOnline article quotes a trademark lawyer and a representative of brand-protection agency MarkMonitor, rather than actual experts in "cyber-crime" (I hate that term). For example:
A registry can't preemptively deny a registration on trademark grounds because trademarks are not unique; several different companies can have the same trademark in different fields of business, or in different countries, or even in different parts of the same country; furthermore, FooBar Inc. has no right to interfere with, for example, my attempt to register foobar-sucks.com or whatever, and distinguishing that from a scammer's attempt to register foobar-inc.com would require actual human judgement.
The wikipedia article IDN Homograph attack seems to cover the basics of the issue reasonably well.
It's both a lot worse and a lot better than that.
And it's also not particularly new.
And the TimesOnline article is bad and misinformed. What else is new? (using a lawyer as a tech source? sheesh!)
First the bad.
The basic Latin and Greek alphabets occur several times in unicode for math/physics purposes. Thus, U+1D400 is Mathematical Bold Capital A, U+1D434 is Mathematical Italic Capital A, U+1D468 is Mathematical Bold Italic Capital A, etc.
Here are some of the Mathematical Capital A's: ð ð¨ðððð¸ð¬ð ððð¼ð°.
Here are some Mathematical Capital Alphas: ð¨ð¢ððð.
Here are various 0's: ððð¢ð¬ð¶.
And this is not a 'K' or a Kappa but the kelvin sign: K.
This is Latin Letter Small Capital A from the phonetic extensions: á´.
And then there are the various accented versions of ordinary Latin characters (which are critical in some -- most -- European languages). Would you notice the dÃfference/difference? ;)
Then the good.
The good news is that it's not much of a problem in practice and it's going to become even less of a problem.
First of all, if the browser doesn't support IDN (Internationalized Domain Names) then they will look like
http://xn--5cab8c.dk-hostmaster.dk/
instead ofhttp://æøå.dk-hostmaster.dk/
.Secondly, not all top-level domains accept the full range of unicode characters -- the Danish .dk only accepts æøåäöüé (which are necessary for our language + loan words from Sweden and Germany).
Thirdly, the problem has been known for years -- we've been able to use æøå etc in Danish domains since 2004 -- and the browser writers are fully aware of the phishing possibilities. They tend to disallow the special interpretation of
xn--
domains or severely filter them using whitelists if the registrar doesn't do it. The Danish ones do, only allowing the handful of letters that are both necessary for us and won't cause problems for Danes.Fourthly, the better browsers already have other anti-phishing measures built-in. Microsoft wrote a lot about it on their blog during the development of IE7, for example. Script mixing in URLs is a major red flag for such systems.
And the not new stuff:
Lots of domains have had this for years. The world has not ended. Browsers have protection built in. The registration authorities also try to protect against this. It has not caused major problems (not much compared to other kinds of phishing). For example, the URL can contain a password after the domain -- and if the domain is specified as an IP address then most people skip it and read the password instead... which might look suspiciously like a domain name. This trick has worked since the nineties (but I think most browsers catch it now).
Wikipedia covers the background pretty well.
Peter, thanks for the detailed reply- looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least. Ironic?
Kevin, thanks also for the detailed reply - but I'm really not worried about trademark law as mr. Bennett is. I'm more concerned because I think people tend to be amazingly credulous, which is why phishing works at all. So if the problem doesn't exist that's great. But if it does exist, I think it opens up more possibilities for people to be dumb.
"...looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least."
Nope. The problem is in the receiving end. The browser & OS need to support the character set and font.
I'm running Firefox and Crunchbang Linux, which have pretty good coverage. I could see almost all characters, only four Math A's are missing.
i think no absolutely no...it won't betray (:
What the heck is Cyrilliac? A digestive intolerance for Russian wheat?
Ah then the problem is probably that I am approving comments on the go from my iPhone. But that techsavvy people running better systems will have fewer problems doesn't comfort me. I'm not worried about YOU falling victim to phishing, Lassi, but people like my mom (sorry mom)
I think Cyrilliac is a character from Dragon Age, Jason.
Nope, it's a lack of fonts with sufficient coverage. 5 of the Math A/Alphas and 1 of the numbers are missing at my end, too. I have them covered by other fonts, though, so they displayed fine in the Character Map applet (Ubuntu 9.10).
What is ironic is that the scienceblogs.com software doesn't quite support UTF8 correctly and that most of its blogs don't use UTF8. The few that do were fixed manually because their bloggers knew what they were doing.
If a web page is using an 8-bit character set then someone has had to make a choice at some point of which tiny set of a few hundred characters to support directly. All the rest have to be supported by character entities (ampersand-hash-digits-semicolon or ampersand-name-semicolon thingies. For example, 'Ã¥' has the name 'aring' and the number 198).
It's great fun when mixing stuff from different sources that don't fit in the same 8-bit encoding, such as the "Most German" list on scienceblogs :/
If you are not using UTF8 then you are doing it wrong.
Here's the technical plenary presentation from the last Internet Engineering Task Force (IETF) meeting (November in Hiroshima); it's a PDF, and it's on this topic:
Internationalization in Names and Other Identifiers
Give Barry a white horse! :)