Switching character sets Windows-to-DOS

Discussion:

peter juuls

2006-05-28 13:00:04 UTC

Hi vim.org,

I have used vim since version 4.x and love it, because
I am a command-line-guy. I just downloaded the brand
new vim70w32.zip and installed on my Windows 2000 pc.
BUT it has always been a mystory to me how to control
character sets used in vim, especially control the
danish characters. I have read the faqs, the
README_DOS.TXT-files etc. with no luck. Could you
please help me, give me a hint?

Files created in Notepad.exe and in DOS-programs use
different character sets. When I run a TYPE command in
a command prompt on a Notepad file, the three extra
danish characters are rubbish. And, when I open a
DOS-file in Notepad, danish characters are rubbish.

Can I switch character sets and have console vim
always display danish characters correctly, no matter
which editor created the file? That would be very
convenient.

My Windows has Regional Settings = Danish.
My _vimrc looks like this:
set nocompatible
source $VIMRUNTIME/mswin.vim
set helpfile=C:\UTIL\vim\vim70\doc\help.txt

Best regards
Peter Juuls

______________________________________________
LLama Gratis a cualquier PC del Mundo.
Llamadas a fijos y móviles desde 1 céntimo por minuto.
http://es.voice.yahoo.com

Juan Lanus

2006-05-28 15:32:45 UTC

Permalink

BUT it has always been a mystery to me how to control
character sets used in vim ..

Me too.
In Spanish we use several accented letters and the ñ that's more or less
like the problem of your's.

I edit old DOS programs with the console vim, named vim.exe and have in the
_vimrc file the following:

" Codepage IBM850
set encoding=cp850

This makes vim interpret the accented characters correctly.
In the _gvimrc file I have:

" encoding: latin1 = ISO8859
set encoding=latin1

wich should be OK for you too.
When I open a DOS file with gvim I can do

:set encoding=cp850

and all gibberish becomes what it should be, on screen.
Upon saving, the file will be written with DOS encoding.

Peter, this is all my knowledge and beyond. Maybe it's useful for you to start a
research.
Good luck!
--
Juan Lanus
TECNOSOL
Argentina

peter juuls

2006-05-29 16:41:59 UTC

Permalink

Post by peter juuls

BUT it has always been a mystery to me how to

control

character sets used in vim ..

Me too.
In Spanish we use several accented letters and the
ñ that's more or less
like the problem of your's.
I edit old DOS programs with the console vim, named
vim.exe and have in the
" Codepage IBM850
set encoding=cp850

Thank you, Juan!!! This is just what I needed to be
able to switch from Windows to DOS character set and
back. It works fine with danish letters too, I guess
because they are part of ISO8859-1.

So simple and still so hard to find in the
documentation....

Best regards
Peter

______________________________________________
LLama Gratis a cualquier PC del Mundo.
Llamadas a fijos y móviles desde 1 céntimo por minuto.
http://es.voice.yahoo.com

A.J.Mechelynck

2006-05-28 16:26:57 UTC

Permalink

Post by peter juuls
Hi vim.org,
I have used vim since version 4.x and love it, because
I am a command-line-guy. I just downloaded the brand
new vim70w32.zip and installed on my Windows 2000 pc.
BUT it has always been a mystory to me how to control
character sets used in vim, especially control the
danish characters. I have read the faqs, the
README_DOS.TXT-files etc. with no luck. Could you
please help me, give me a hint?
Files created in Notepad.exe and in DOS-programs use
different character sets. When I run a TYPE command in
a command prompt on a Notepad file, the three extra
danish characters are rubbish. And, when I open a
DOS-file in Notepad, danish characters are rubbish.
Can I switch character sets and have console vim
always display danish characters correctly, no matter
which editor created the file? That would be very
convenient.
My Windows has Regional Settings = Danish.
set nocompatible
source $VIMRUNTIME/mswin.vim
set helpfile=C:\UTIL\vim\vim70\doc\help.txt
Best regards
Peter Juuls

[advertisement snipped]

If you have some files using a Dos charset, and other ones using a
Windows charset, the way to do it is file-by-file. Here are a few
sections you should read in the help:

" 'encoding' (global) defines the way Vim internally represents the data
:help 'encoding'
" 'fileencoding' (local to buffer) defines how the file's data is
represented on disk
:help 'fileencoding'
" 'fileencodings' (global, and with s at the end) defines the
heuristics used by Vim to guess the 'fileencoding' when reading a file
:help 'fileencodings'
" 'termencoding' (global) defines how your keyboard (and, in console
Vim, your display) represents the data
:help 'termencoding'
" Modelines allow setting local options on a file-by-file basis
:help modeline
" See also how Vim names the various charsets
:help encoding-names
" and how to set the 'fileencoding' manually when reading or writing
one particular file
:help ++opt
"etc.

I don't guarantee that setting the 'fileencoding' by means of a modeline
will work, however, because to read the modeline itself, it is necessary
to read the file: chicken-and-egg problem.

Most of these options require that Vim be compiled with the +multi_byte
feature, even if you always set these options to single-byte (8-bit)
encodings. That may be strange but it is a design feature, and you
should be aware of it, or you may run into problems if you use a
-multi_byte version of Vim by mistake. To check it, use ":version" (the
answer should include +multi_byte or +multi_byte_ime, with or without
/dyn), or ":echo has('multi_byte')" which should return a nonzero value,
normally 1. For instance, in your vimrc, you could write:

if has("multi_byte")
" replace this comment by whatever is needed for Danish support
else
echoerr "This Vim version wasn't compiled with multiple-charset
support"
endif

The reason I mention 'termencoding' is that, by default, it is empty,
which means "use the value of 'encoding'". This is usually correct when
you start Vim, because the default value of 'encoding' is obtained from
your OS. But if you change 'encoding', for instance to set it to UTF-8,
which can represent any kind of text data known to man, the way your
keyboard represents your keystrokes doesn't change. Therefore, changing
'encoding' should be done using a construct similar to the following:

if &termencoding == ""
let &termencoding = &encoding
endif
set encoding=utf-8

The 'encoding' option, which is global, must be set to some value which
allows representation of all the characters used by all the files you
may be editing, either concurrently, or successively without changing
'encoding'. Depending in part on which "special" characters are included
in your Danish text, Latin1 may or may not be good enough; UTF-8 will,
at a slight expense of memory.

Now, the encoding names (for the buffer-local 'fileencoding' option).
IIUC, the names you need are probably the following:

cp850 (the "international" Dos codepage), and

cp1252 (Windows's "Western Europe" charset). There are also

latin1 (aka ISO-8859-1), the ISO charset for Western Europe defined
prior to the invention of the Euro currency, and

iso-8859-15 (aka Latin9), a charset very similar to Latin1 but which
includes the Euro sign.

The latter two are "international standard" charsets, not a property of
Bill Gates. ;-)

You can check the Dos codepage by issuing the CHCP command (with no
arguments) at the prompt in a Dos box. I'm not sure how to check the
Windows charset.

Now here is how you tell Vim a file's encoding, once 'encoding' is
already set to some "compatible" value:

:e ++enc=cp850 filename.ext

Since cp850 and cp1252 are both 8-bit encodings, it's not possible to
set the 'fileencodings' heuristics to automagically detect them both
without a modeline, because neither will, for any file, return the
"wrong charset" signal to the heuristic. This means that if you have
them both in the 'fileencodings' option, Vim will never use whichever of
them comes last. If your "most used" 8-bit charset is Windows-1252, then
you would "typically" use:

if has("multi_byte")
if &termencoding == ""
let &termencoding = &encoding
endif
set encoding=utf-8
set fileencodings=ucs-bom,utf-8,cp1252
setglobal fileencoding=cp1252
else
echoerr "ERROR: Can't handle multiple encodings! You need to
recompile Vim!"
endif

(ucs-bom and utf-8 are Unicode heuristics, and _they_ can return a
"wrong charset" signal to the charset-detecting heuristic, which then
proceeds to check the file for the next charset in the list.) This will
detect 7-bit ASCII files (files which don't contain any character higher
than 127) as being in UTF-8. This is normal: the same data is
represented identically in 7-bit ASCII, in UTF-8, and indeed in the
first half of most 8-bit ASCII encodings including the Latin1 and Latin9
encodings mentioned above.

With the above settings, you should only need to use the ++enc argument
for files which are not in your "default" charset, meaning that

:e file1.txt

would open a file in Windows-1252; and

:new ++enc=cp850 file2.txt

would split the window to open a file in cp850.

Best regards,
Tony.

peter juuls

2006-05-29 17:28:46 UTC

Permalink

Post by A.J.Mechelynck
If you have some files using a Dos charset, and
other ones using a
Windows charset, the way to do it is file-by-file.
Here are a few

Thanks, Tony, for a thorough walkthrough of the
character set encoding options in vim, not only
regarding Windows-to-DOS switching, but in general.

My primary needs, by now, are to be able, on W32, to
open, display, edit and save files in 3 formats
- DOS-files with danish letters (CHCP tells me cp850
is my current codepage and :set encoding=cp850 solves
my switching problem)
- Notepad-files with danish letters (works out-of-the
box, as console vim7.0/W32 uses this as default, I
guess it is Windows-1252 character set - besides I can
use :set encoding=latin1 or :set encoding=latin9 in
vim, if I need to switch back from some other
encoding)
- Unicoded files, like exports from Registry Editor,
with or without danish letters (works out-of-the box
in vim7.0/W32, informing me that the file has been
converted, when opened, and vim also saves the
modified file in Unicode)

Thanks for your comprehensive reply, I will save it,
in case I run into problems with odd character sets
and file encodings.

Thanks
Peter

______________________________________________
LLama Gratis a cualquier PC del Mundo.
Llamadas a fijos y móviles desde 1 céntimo por minuto.
http://es.voice.yahoo.com

A.J.Mechelynck

2006-05-29 23:37:15 UTC

Permalink

Post by peter juuls

Post by A.J.Mechelynck
If you have some files using a Dos charset, and
other ones using a
Windows charset, the way to do it is file-by-file.
Here are a few

Thanks, Tony, for a thorough walkthrough of the
character set encoding options in vim, not only
regarding Windows-to-DOS switching, but in general.
My primary needs, by now, are to be able, on W32, to
open, display, edit and save files in 3 formats
- DOS-files with danish letters (CHCP tells me cp850
is my current codepage and :set encoding=cp850 solves
my switching problem)
- Notepad-files with danish letters (works out-of-the
box, as console vim7.0/W32 uses this as default, I
guess it is Windows-1252 character set - besides I can
use :set encoding=latin1 or :set encoding=latin9 in
vim, if I need to switch back from some other
encoding)
- Unicoded files, like exports from Registry Editor,
with or without danish letters (works out-of-the box
in vim7.0/W32, informing me that the file has been
converted, when opened, and vim also saves the
modified file in Unicode)
Thanks for your comprehensive reply, I will save it,
in case I run into problems with odd character sets
and file encodings.
Thanks
Peter

In addition to what I szaid in my earlier post, I might add that most
Unicode files produced by Windows are in UTF-16 little-endian with BOM.
These files will be automagically recognised by Vim, and displayed
correctly, if your 'encoding' is set to UTF-8 and your 'fileencodings'
heuristics starts with "ucs-bom" (as in the example code snippet in my
previous post). In that case, ":setlocal fileencoding? bomb?" on such a
file should asnswer " fileencoding=ucs-2le" and " bomb".

I have found it useful to display each file's encoding on its status
line. Here is how I set the 'statusline' option, you may use it as a
source of inspiration if you want (see ":help 'statusline' to decipher
it). If you want to use it, start by a copy-paste into your vimrc and
then edit it to your heart's liking:

if has("statusline")
set statusline=%<%f\ %h%m%r%=%k[%{(&fenc\ ==\
\"\"?&enc:&fenc).(&bomb?\",BOM\":\"\")}]\ %-14.(%l,%c%V%)\ %P
endif

It's one long line, bracketed in an ":if" statement to avoid an error on
Vim versions which cannot set a user-defined status line. If your mailer
or mine "beautifies" the :set line by adding extra line breaks, it will
probably break the line (once or more) at a backslash-escaped space.

Note that WordPad can read UTF-8 files if they have a BOM (if they have
":setlocal bomb") but it will write them as UTF-16le (aka ucs-2le) which
is a different Unicode encoding and, for Latin-alphabet text, usually
uses more disk space. The BOM (acronym of "byte order mark") is the
Unicode codepoint U+FEFF "zero-width no-break space" when at the very
start of a file. That codepoint has a different representation in each
of the basic 5 Unicode encodings, and its value in each of them is
"illegal" in all others (assuming that a little-endian UTF-16le file
won't start with a NULL). It is therefore used to discriminate between
Unicode encodings. See, among others, ":help Unicode" and
http://www.unicode.org/ for more info on Unicode.

Best regards,
Tony.