Even our fonts will betray us?

By bioephemera on January 1, 2010.

According to Christina Warren at mashable.com, the switch to allowing non-Latin alphabet characters in web domains could give scammers a brand new toolkit. That's because browsers can't render many non-Latin characters, and the approximations may be doppelgangers for trusted sites. Alternatively, an address in an alphabet like Cyrillic, which shares certain letterforms with the Latin alphabet, can appear indistinguishable from pre-existing Latin-alphabet addresses:

Uh-oh.

It's only fair that users of different alphabets get to register their own addresses, but clearly there needs to be some kind of fix here, either from ICANN or from the tech side. But Warren cites a TimesOnline article from two days ago in which it sounds like no one is really taking a hard look at solutions.

More like this

In fact, people have been working on this since at least as far back as 2001.

Also, take note at who exactly is being quoted in those articles; the TimesOnline article quotes a trademark lawyer and a representative of brand-protection agency MarkMonitor, rather than actual experts in "cyber-crime" (I hate that term). For example:

âThey [Icann] seem to have started the process of allowing people to register domain names in non-Roman characters but donât seem to have put in place anything that obligates any registry to safeguard trademark rights or the rights of legitimate businesses that use the same name,â Mr Bennett said.

A registry can't preemptively deny a registration on trademark grounds because trademarks are not unique; several different companies can have the same trademark in different fields of business, or in different countries, or even in different parts of the same country; furthermore, FooBar Inc. has no right to interfere with, for example, my attempt to register foobar-sucks.com or whatever, and distinguishing that from a scammer's attempt to register foobar-inc.com would require actual human judgement.

The wikipedia article IDN Homograph attack seems to cover the basics of the issue reasonably well.

It's both a lot worse and a lot better than that.

And it's also not particularly new.

And the TimesOnline article is bad and misinformed. What else is new? (using a lawyer as a tech source? sheesh!)

First the bad.
The basic Latin and Greek alphabets occur several times in unicode for math/physics purposes. Thus, U+1D400 is Mathematical Bold Capital A, U+1D434 is Mathematical Italic Capital A, U+1D468 is Mathematical Bold Italic Capital A, etc.
Here are some of the Mathematical Capital A's: ð ð¨ðððð¸ð¬ð ððð¼ð°.
Here are some Mathematical Capital Alphas: ð¨ð¢ððð.
Here are various 0's: ððð¢ð¬ð¶.
And this is not a 'K' or a Kappa but the kelvin sign: K.
This is Latin Letter Small Capital A from the phonetic extensions: á´.

And then there are the various accented versions of ordinary Latin characters (which are critical in some -- most -- European languages). Would you notice the dÃfference/difference? ;)

Then the good.
The good news is that it's not much of a problem in practice and it's going to become even less of a problem.

First of all, if the browser doesn't support IDN (Internationalized Domain Names) then they will look like http://xn--5cab8c.dk-hostmaster.dk/ instead of http://Ã¦Ã¸Ã¥.dk-hostmaster.dk/.

Secondly, not all top-level domains accept the full range of unicode characters -- the Danish .dk only accepts Ã¦Ã¸Ã¥Ã¤Ã¶Ã¼Ã© (which are necessary for our language + loan words from Sweden and Germany).

Thirdly, the problem has been known for years -- we've been able to use Ã¦Ã¸Ã¥ etc in Danish domains since 2004 -- and the browser writers are fully aware of the phishing possibilities. They tend to disallow the special interpretation of xn-- domains or severely filter them using whitelists if the registrar doesn't do it. The Danish ones do, only allowing the handful of letters that are both necessary for us and won't cause problems for Danes.

Fourthly, the better browsers already have other anti-phishing measures built-in. Microsoft wrote a lot about it on their blog during the development of IE7, for example. Script mixing in URLs is a major red flag for such systems.

And the not new stuff:
Lots of domains have had this for years. The world has not ended. Browsers have protection built in. The registration authorities also try to protect against this. It has not caused major problems (not much compared to other kinds of phishing). For example, the URL can contain a password after the domain -- and if the domain is specified as an IP address then most people skip it and read the password instead... which might look suspiciously like a domain name. This trick has worked since the nineties (but I think most browsers catch it now).

Wikipedia covers the background pretty well.

Peter, thanks for the detailed reply- looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least. Ironic?

Kevin, thanks also for the detailed reply - but I'm really not worried about trademark law as mr. Bennett is. I'm more concerned because I think people tend to be amazingly credulous, which is why phishing works at all. So if the problem doesn't exist that's great. But if it does exist, I think it opens up more possibilities for people to be dumb.

"...looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least."

Nope. The problem is in the receiving end. The browser & OS need to support the character set and font.

I'm running Firefox and Crunchbang Linux, which have pretty good coverage. I could see almost all characters, only four Math A's are missing.

i think no absolutely no...it won't betray (:

What the heck is Cyrilliac? A digestive intolerance for Russian wheat?

Ah then the problem is probably that I am approving comments on the go from my iPhone. But that techsavvy people running better systems will have fewer problems doesn't comfort me. I'm not worried about YOU falling victim to phishing, Lassi, but people like my mom (sorry mom)

I think Cyrilliac is a character from Dragon Age, Jason.

Peter, thanks for the detailed reply- looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least. Ironic?

Nope, it's a lack of fonts with sufficient coverage. 5 of the Math A/Alphas and 1 of the numbers are missing at my end, too. I have them covered by other fonts, though, so they displayed fine in the Character Map applet (Ubuntu 9.10).

What is ironic is that the scienceblogs.com software doesn't quite support UTF8 correctly and that most of its blogs don't use UTF8. The few that do were fixed manually because their bloggers knew what they were doing.

If a web page is using an 8-bit character set then someone has had to make a choice at some point of which tiny set of a few hundred characters to support directly. All the rest have to be supported by character entities (ampersand-hash-digits-semicolon or ampersand-name-semicolon thingies. For example, 'Ã¥' has the name 'aring' and the number 198).

It's great fun when mixing stuff from different sources that don't fit in the same 8-bit encoding, such as the "Most German" list on scienceblogs :/

If you are not using UTF8 then you are doing it wrong.

Here's the technical plenary presentation from the last Internet Engineering Task Force (IETF) meeting (November in Hiroshima); it's a PDF, and it's on this topic:
Internationalization in Names and Other Identifiers

Give Barry a white horse! :)

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Goodbye to Scienceblogs

September 15, 2011

A few weeks ago, I was notified that if I wished to continue blogging at Scienceblogs/National Geographic, I'd have to agree to new terms. After considering these terms, as well as the decision to ban pseudonymous blogging, I don't feel that the new management and I are on the same page. I have…

SpaceChem!

September 14, 2011

A few months ago I got an email from Zachtronics, creators of the Codex of Alchemical Engineering, about the new indie game called SpaceChem. It was billed as "an obscenely addictive, design-based puzzle game about building machines and fighting monsters in the name of science." What's not to love…

Mechanical butterfly, circa 1911

September 14, 2011

Check out this great slideshow of fascinating advertising novelties from 1911, over at Scientific American.

Pseudonymity: Five Reasons the New Scienceblogs/NG Policy is Misguided

September 14, 2011

Recently, Scienceblogs/National Geographic decided it would no longer host pseudonymous science bloggers. As a result, many of my former colleagues have left. I think this decision was wrong. Read on for my reasons. One: simple fairness. Several well-established pseudonymous bloggers had been…

Seeing the invisible? There's an app for that

September 8, 2011

This video from Xperia Studio very effectively conveys how data visualization can both leverage and challenge our conceptions of "reality." The night sky we've seen since childhood, like everything else we see, is just a tiny slice of the spectrum - only what we can perceive with our limited…

Even our fonts will betray us?

More like this

Goodbye to Scienceblogs

SpaceChem!

Mechanical butterfly, circa 1911

Pseudonymity: Five Reasons the New Scienceblogs/NG Policy is Misguided

Seeing the invisible? There's an app for that

Throwback Thursday: How Dark Matter’s #1 Competitor Died (Synopsis)

The Tet Zoo guide to the creatures of Avatar

Daisy chain house swapping