Introduction
This is a
class derived from CStdioFile
which transparently handles the reading
and writing of Unicode
text files as well as ordinary multibyte text files.
The code
compiles as both multibyte and Unicode.
In Unicode, multibyte
files will be read and their content converted to Unicode
using the current code page. In multibyte compilations, Unicode
files will be read and converted to multibyte text.
The
identification of a Unicode
text file depends entirely on the presence of the Unicode
byte order mark (0xFEFF). Its absence is not an absolute
guarantee that a file is not Unicode,
but it's the only method I use here. Feel free to suggest
improvements.
By default,
the class writes multibyte files, but can optionally write
Unicode.
Background
The ability
to transparently handle both multibyte and Unicode
seems to be such a fundamental requirement, that I was
sure that there would already be something similar on
offer, and yet nothing turned up. Did I miss something?
I needed it
for a translation tool I wrote, and knocked together an
implementation that was good enough for my needs. This is
little more than a cleaned up version of that, so expect
bugs and all manner of deficiencies. I've tested the demo
app though with the basic combinations -- Unicode
files in a multibyte compilation, Unicode-Unicode,
Multibyte-Unicode, and
Multibyte-Multibyte, and they all seem to work.
Using the
code
The use of
the class is pretty simple. It overrides three functions
of CStdioFile
:
Open()
, ReadString()
and WriteString()
.
To write a Unicode file, add the flag CStdioFileEx::modeWriteUnicode
to the flags when calling the Open()
function.
In other
respects, usage is identical to CStdioFile
.
To find out
if a file you have opened is Unicode, you can call IsFileUnicodeText()
.
To get the
number of characters in the file, you can call GetCharCount()
.
This is unreliable for multibyte/UTF-8, however.
An example of
writing in Unicode:
// Test writing
CStdioFileEx fileWriteUnicode;
if (fileWriteUnicode.Open(_T("c:\\testwrite_unicode.txt"),
CFile::modeCreate | CFile::modeWrite | CStdioFileEx::modeWriteUnicode))
{
fileWriteUnicode.WriteString(_T("Unicode test file\n"));
fileWriteUnicode.WriteString(_T("Writing data\n"));
fileWriteUnicode.Close();
}
You can now
also specify the code page for multibyte file reading or
writing. Simply call SetCodePage()
before a
read to tell CStdioFileEx
which code page the
file is coded in, or before a write, to tell it which code
page you want it written in. Specifying CP_UTF8
as the code page allows you to read or write UTF-8 files.
The demo app
is a dialog which opens a file, tells you whether it's
Unicode or not and how many characters it contains, and
shows the first fifteen lines from it. In the last couple
of iterations I've added the option to convert a Unicode
file to multibyte, and a multibyte file to Unicode, and a
combo to specify the code page when reading.
As of v1.6,
there is no limitation on the length of the line that can
be read in any mode (Multibyte/Unicode,
Unicode/Multibyte,
etc.).
I'd love to
hear of people's experiences with it, as well as reports
of bugs, problems, improvements, etc.
Oh, and if
I've accidentally included something offensive in the demo
dialog, let me know. My Arabic and Chinese are not all
that good.
History
-
v1.0 -
Posted 14 May 2003
-
v1.1 - 23
August 2003. Incorporated fixes from Dennis Jeryd
-
v1.2 - 06
January 2005. Fixed garbage at end of file bug (Howard
J Oh)
-
v1.3 - 19
February 2005. Howard J Oh's fix mysteriously failed
to make it into the last release. Improved the test
program. Fixed miscellaneous bugs
Very important: In this release, ANSI
files written in ANSI are no longer written using WriteString
.
This means \n
will no longer be
"interpreted" as \r\n
. What you
write is what you get
-
v1.4 - 26
February 2005. Fixed submission screw-up
-
v1.5 - 18
November 2005. Code page can be specified for reading
and writing (inc. UTF-8). Multibyte buffers properly
calculated. Fix from Andy Goodwin
-
v1.6 - 19
July 2007. Major rewrite: Maximum line length
restriction removed; Use of strlen
/lstrlen
eliminated. Conversion functions always used to
calculate required buffers; \r
or \n
characters no longer lost; BOM writing now optional;
UTF-8 reading and writing works properly; systematic
tests are now included with the demo project