|
Text file encoding |
Posted on: 7/17 22:31
#1 |
---|---|---|
Just popping in
![]() ![]() Joined:
2015/9/28 23:42 From Bettendorf, IA, USA
Posts: 215
|
Is there really a need anymore to convert a text file to/from ANSI to/from UTF8?
Going from UTF8 to ANSI would just strip out any extended characters, yes? Going from ANSI to UTF8 would do nothing until you add your new extended characters, yes? So just always stay in UTF8 encoding since it is a global Amiga scene, not just English. Yes? |
|
|
Re: Text file encoding |
Posted on: 7/17 22:53
#2 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/12/4 23:15 Posts: 2145
|
For developement text editor I'm not sure utf-8 is a good idea. Especially as utf-8 is so badly supported on amigaos anyway.
Some programming languages support utf-8 strings some don't |
|
|
Re: Text file encoding |
Posted on: 7/18 0:47
#3 |
---|---|---|
Quite a regular
![]() ![]() Joined:
2006/12/2 0:35 From Sydney
Posts: 685
|
@mritter0
UTF-8 is reasonably well supported at the DOS and file system level, but graphics.library can't display UTF-8 characters and needs an extensive rewrite in that area. Graphics also needs to support L-R and R-L text drawing. AFAIK no one has taken on the job yet. The question arises - who would you be working for? |
|
_________________
cheers tony |
||
|
Re: Text file encoding |
Posted on: 7/18 4:27
#4 |
---|---|---|
Quite a regular
![]() ![]() Joined:
2006/12/6 19:36 Posts: 505
|
@mritter0
If you’re going to look into this,rmind there are sub variants such as UTF-8 NFD and UTF-8 NFC . Why oh why this big mess!?!? :~| UTF-8 Text at www.ietf.org UTF-8 at Wikipedia |
|
|
Re: Text file encoding |
Posted on: 7/18 23:08
#5 |
---|---|---|
Just can't stay away
![]() ![]() Joined:
2006/12/1 18:01 From Copenhagen, Denmark
Posts: 1112
|
@mritter0
Quote: Is there really a need anymore to convert a text file to/from ANSI to/from UTF8? Quote: Going from UTF8 to ANSI would just strip out any extended characters, yes? Also check the C:CharSetConvert command and its documentation, as well as the files Charsets.doc and Fonts.doc (and maybe Keyboards.doc and Keymaps.doc), They are all found in SYS:Documentation/. Oh, and they are plain text files, not some fancy Windoze Word format, just in case you thought so ... Quote: So just always stay in UTF8 encoding Quote: graphics.library can't display UTF-8 characters Best regards, Niels |
|
|
Re: Text file encoding |
Posted on: 7/18 23:24
#6 |
---|---|---|
Just popping in
![]() ![]() Joined:
2015/9/28 23:42 From Bettendorf, IA, USA
Posts: 215
|
OK, sounds like UTF-8 is out. I didn't know how (in)complete OS4's handling of it was.
I just had some code that does UTF-8 to Latin1 conversion on a string. But I don't have any save code, if it is any different. |
|
_________________
Workbench Explorer - A better way to browse drawers |
||
|
Re: Text file encoding |
Posted on: 7/19 22:49
#7 |
---|---|---|
Just can't stay away
![]() ![]() Joined:
2006/12/1 18:01 From Copenhagen, Denmark
Posts: 1112
|
@tony, broadblues or whoever might know:
Are there functions in some library to perform the conversions on text strings that the CharsetConvert command makes? I thought there might be in locale.library, but I didn't see any in the autodoc. I just wouldn't have thought that CharsetConvert was "hand-coded" to do it internally. If there are, mritter could still offer to save files as UTF-8 or import UTF-8 files, while sticking to ANSI/ISO (whichever variant matches the current system charset setting) as the internal and default representation. Heck, if you wanted, mritter, you could even be forward thinking and perform everything internally as UTF-8, but just save and load ISO as default - with whatever limitations that would entail (e.g. having to fallback to U codes for characters outside the character set you're saving to, and something similar when trying to display such a character). Best regards, Niels Edit: If not, there is an iconv_lib on OS4Depot which might help? |
|
|
Re: Text file encoding |
Posted on: 7/20 9:19
#8 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/12/4 23:15 Posts: 2145
|
@nbache
Quote:
Yes and no. There is an API in diskfont.librray to access the charset information in L:Charsets (See IDiskFont->ObtainCharsetInfo() ) so converting between a given charset and unicode can be done via that function. I use this in SketchBlock to find glyphs in fonts for the text engine. There aren't any public OS functions to encde/decode utf-8 (7,16 32 or any other variant) that I'm aware of. These would need to be implmented by the coder. [edit] There is a bunch of stuff in recent betas of utility.library Quote:
Or for a much more amiga friendly way there is the defacto standard of codesets.library which wraps the various OS variants functions with a higher level API. Used by YAM Aweb and a few others. |
|
|
Re: Text file encoding |
Posted on: 7/20 17:26
#9 |
---|---|---|
Just popping in
![]() ![]() Joined:
2006/12/5 22:55 From Vantaa,Finland
Posts: 32
|
@broadblues
I'm actually using utility.library functions for character conversion and I have normal AOS4.1Final with update1, not any betas. As my program uses internally UCS-4 as a storage format, these examples convert from into UCS-4 and from UCS-4 to current charset, but maybe one gets the clue from this.
...
Edit: added the typedefs. Marko |
|
|
Re: Text file encoding |
Posted on: 7/20 20:19
#10 |
---|---|---|
Just popping in
![]() ![]() Joined:
2015/9/28 23:42 From Bettendorf, IA, USA
Posts: 215
|
@nbache
I was thinking about adding menu options to "Load as UTF-8" and "Paste as UTF-8". I don't know if saving involves anything extra. I have never used a UTF-8 text file. I wouldn't be looking to do anything overly complicated like finding a correct font (we don't have many). If the users locale can display the characters then great. Like you said, looking ahead, but nothing complicated. |
|
|
Re: Text file encoding |
Posted on: 7/24 20:47
#11 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/11/20 16:26 From Norway
Posts: 2740
|
@tonyw
I did, UTF8.library on OS4Depot.net, so I did the work. It's probably better to use 32bit Unicode, as no need to convert to and from, should better for a text editor. So need to worry about inserting different char lengths into middle of strings. anyway the work was partially unnecessary, as in C++ there is support for UTF8 strings. Yes the issue like this, UTF8 sucks because there few text editors that supports it, so if he writes a text editor that support UTF8 as reading and writing that be really nice. and no saving it as 8bit ascii (using codesets encoding), is no replacement for utf8, the internet and XML files are utf8 and if your editing a web pages on Amiga will just trash this. Anyway I believe all QT programs support UTF8. As for general problem with supporting unicode strings, is that the fonts are limited the languages and does not include all glyphs of all languages. so if your editing a string that chinese and Arabic and some Russian in the same file, you need to render the text with different fonts, for different parts of the text. Anyway rendering the text is one problem, being able to type chinese is different problem. |
|
_________________
(NutsAboutAmiga) Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps. |
||
|
Re: Text file encoding |
Posted on: 7/24 21:08
#12 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/11/20 16:26 From Norway
Posts: 2740
|
@blmara
So there is some support in AmigaOS4.x, that's nice. if know wont have wasted time on my own library, but anyway I learned a lot from doing the work. so now you only need to render the glyphs one by one using the bullet API, or using truetype library. So etch char in UCS4, should map directly to glyphs in the fonts. |
|
_________________
(NutsAboutAmiga) Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps. |
||
|
Re: Text file encoding |
Posted on: 7/24 21:11
#13 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/11/20 16:26 From Norway
Posts: 2740
|
@broadblues
>> There is a bunch of stuff in recent betas of utility.library Aha so it not available for normal people. |
|
_________________
(NutsAboutAmiga) Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps. |
||
|
Re: Text file encoding |
Posted on: 7/24 21:28
#14 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/11/20 16:26 From Norway
Posts: 2740
|
@mritter0
so if like to work with utf8, one thing you should know is that utf8, one char can be 1 byte, 2 bytes or 3 bytes or more. the encoding is simple.. if char starts with binary sequence of if bit 7 is 1, then is multi byte char, if bit 7 is 0 then its a 7bit char. if bit 7,6 is set then its two bytes. if bit 7,6,5 is set then 3 bytes char if bit 7,6,5,4 is set then its 4 bytes char. next bit after first active bit is always 0, as indication of sequence stop. the other bits are masked and shifted into place, until you have 32bit char, that reprencet all possible glyphs. if you find byte that is 7 is set, and 6 is not set, then you have a illegal char. and it's probably 8bit ASCII not UTF8. Because you do not have fixed length in utf8, it not easy to work with. think about as having to work with RLE encoded image, it possible, but not really practical. decoding is relatively fast at lot faster then rendering the text, better options is storing strings in ram as 32bit, then you have fixed length, and then its as easy as working with chars more or less. the other issue need to work out is the ABC of different languages, as need to be able convert from lower char to upper char, in ascii is easy just add value to char to convert to upper char, subtract value from char to get lower char, but that does not work in UTF8. as for typing text into UTF8, when get key press, you need to convert the char into a glyph code, this can be done with codeset.library, instead of storing the ascii value into UTF8 string you store the glyph value into the string, you use the codepage to look it up. Edited by LiveForIt on 2019/7/24 21:43:21
Edited by LiveForIt on 2019/7/24 21:43:56 Edited by LiveForIt on 2019/7/24 21:45:34 Edited by LiveForIt on 2019/7/24 22:00:00 Edited by LiveForIt on 2019/7/24 22:01:01 |
|
_________________
(NutsAboutAmiga) Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps. |
||
|
Re: Text file encoding |
Posted on: 7/24 22:48
#15 |
---|---|---|
Just can't stay away
![]() ![]() Joined:
2006/12/1 18:01 From Copenhagen, Denmark
Posts: 1112
|
@LiveForIt
Quote: the other issue need to work out is the ABC of different languages, as need to be able convert from lower char to upper char, in ascii is easy just add value to char to convert to upper char, subtract value from char to get lower char, but that does not work in UTF8. Best regards, Niels |
|
|
Re: Text file encoding |
Posted on: 7/25 9:38
#16 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/12/4 23:15 Posts: 2145
|
@LiveForIt
Quote:
If working in C using utf-8 internally alows you to work with existing string.h functionaility as null terminated strings are still valid. With 4 byte unicode values the strings are full of nulls, so none of the strings functions work. Conversion for rendering is slower though ofcourse, but I don't think it would be a huge bottleneck. Quote:
Saving out as utf-8 is always useful but you should use character references for all non asci code points even in a utf-8 encoded webpage. Quote:
Who cares ![]() Quote:
If you've got to the point where you are worrying about that you have solved alot of problems! And the OP is making programming text editor, that is unlikely to be using kanji cyrillic or arabic or chines charaters that often. Quote:
Quite. TBH you need a dedicated keyboard for that anyway even on windows (or an awful lot of obscure key sequances1). |
|
|
Re: Text file encoding |
Posted on: 7/25 9:40
#17 |
---|---|---|
Home away from home
![]() ![]() Joined:
2006/12/4 23:15 Posts: 2145
|
@LiveForIt
Quote:
You quoted me an blmara out of order, it seems from his post the functions are public. You need version of 54 of utility.library I mistakenly assued that was stilll beta, I might be wrong, haven't ad a chance to verify. |
|
|
Re: Text file encoding |
Posted on: 7/25 14:51
#18 |
---|---|---|
Just popping in
![]() ![]() Joined:
2015/9/28 23:42 From Bettendorf, IA, USA
Posts: 215
|
It looks like UTF8 is still far from a friendly, usable state. More work than I want to put in at this time. And if it will slow things down processing the extra bits, then no. I want it to be as fast/smooth as possible.
The syntax highlighting slows things down a little (depending on language), and it is not as accurate as I would like. I would rather spend my time fixing that. Thanks for the input. |
|
_________________
Workbench Explorer - A better way to browse drawers |
||
|
Re: Text file encoding |
Posted on: 7/25 16:46
#19 |
---|---|---|
Just popping in
![]() ![]() Joined:
2012/10/17 19:42 Posts: 88
|
@mritter0
If it is for Struct then isn’t performance better using ASCII at least for C/C++ ... Any plan to do hardware scrolling using the GPU a bit like what Cygnus did with the Blitter? Are you optimizing your code to limit caches misses? |
|
|
Re: Text file encoding |
Posted on: 7/25 21:25
#20 |
---|---|---|
Just popping in
![]() ![]() Joined:
2015/9/28 23:42 From Bettendorf, IA, USA
Posts: 215
|
@Kamelito
Performance first. My thought for UTF8 was for when doing locale strings, not so much for everyday programming. GPU scrolling, no idea how to do that. I have not looked into optimizing yet. Still working on some of the core functions. I am working on getting to a point where I can use it to edit it's own code without too much hassle. Not too far to go....... |
|