Strategy for guessing encodings

(Comments)

Came across this in comp.lang.python, fairly useful for localization work as you often have to guess what encoding something is in:
Re: character encoding conversion
Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
   absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
   range(128,160)
6. use cp1252
7. use Latin-1
 
In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.
Currently unrated

Comments

Recent Posts

Archive

2015
2014
2013
2010
2009
2008
2007
2006
2005
2004

Categories

Tags

Authors

Feeds

RSS / Atom