Unicode encoding and Decoding has become very common now a days.As a developer who uses Object oriented programming ,it is easy to find Unicode encoding decoding inbuilt methods.Most of the current High-level programming have those incorporated into them by default.But when I was working on a project using C which required to parse binary data and show the ASCII/Unicode value I found little bit challenging.We were parsing some FAST encoded messages . We were converting them to a binary string like "10001010010......" and partitioning them with Stop-bit.All was going fine until we encounter Unicode character which had a variable length encoding unlike ASCII.
so to solve this issue I studied about overview of Unicode and code-point . Here is a YouTube link that also help me to understand the concept :
Characters in a computer - Unicode Tutorial UTF-8
I also asked the question in stackoverflow :
Binary to UTF-8 in C
I found that I had to represent the binary char array in the following way i.e considering the Leading and Continuation Bytes and converting Binary Code-point to decimal code-point.
"
UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0x000000 to 0x10FFFF:
so I wrote a function like this which would take character array and Unicode char array(which will be filled with the resulting Unicode character ) and calculate the decimal code point and get the utf-8 value :
void processBinaryData(char* pBinaryData,int32_t n,char *unicodeString){
char arr[n*8],i=1,offset=0;
int m =0;
if(n>1){
size_t len = strlen((char *)pBinaryData);
while(len>0){
len = len-8;
if(i==1){
offset= n+1;
memset(arr, 0, n*8);
for( m=0;m<8-(n+1);m++){
arr[m] = pBinaryData[offset+m];
}
offset = 10;//offset+8-(n+1)+2;
}else{
int j = 0;
for( j=0;j<6;j++){//i*8-2
arr[m] = pBinaryData[offset];
m++;
offset++;
}
offset=i*8+2;
}
i++;
}
GetUnicodeChar(bin2dec(arr),unicodeString);
///////////////////
}
else{
GetUnicodeChar(bin2dec(pBinaryData),unicodeString);
return;
}
}
Here for "GetUnicodeChar" function I used it from a stackoverflow question which was converting the decimal code-point to Unicode character .
The full source code for this is here binaryToUtf8
Here I just tried to explain how I implemented the Unicode decoding for a Fast encoded 7-bit stopbit data and the code in github was the initial draft .Hope 8-bit encoded data will also have same solution.
Other sources:-unicode-utf-8-tutorial ,c-windows-decimal-to-utf-8-character-conversion
so to solve this issue I studied about overview of Unicode and code-point . Here is a YouTube link that also help me to understand the concept :
Characters in a computer - Unicode Tutorial UTF-8
I also asked the question in stackoverflow :
Binary to UTF-8 in C
I found that I had to represent the binary char array in the following way i.e considering the Leading and Continuation Bytes and converting Binary Code-point to decimal code-point.
"
UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0x000000 to 0x10FFFF:
00000000 -- 0000007F: 0xxxxxxx 00000080 -- 000007FF: 110xxxxx 10xxxxxx 00000800 -- 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 00010000 -- 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx" ---Source https://www.cprogramming.com/tutorial/unicode.html
so I wrote a function like this which would take character array and Unicode char array(which will be filled with the resulting Unicode character ) and calculate the decimal code point and get the utf-8 value :
void processBinaryData(char* pBinaryData,int32_t n,char *unicodeString){
char arr[n*8],i=1,offset=0;
int m =0;
if(n>1){
size_t len = strlen((char *)pBinaryData);
while(len>0){
len = len-8;
if(i==1){
offset= n+1;
memset(arr, 0, n*8);
for( m=0;m<8-(n+1);m++){
arr[m] = pBinaryData[offset+m];
}
offset = 10;//offset+8-(n+1)+2;
}else{
int j = 0;
for( j=0;j<6;j++){//i*8-2
arr[m] = pBinaryData[offset];
m++;
offset++;
}
offset=i*8+2;
}
i++;
}
GetUnicodeChar(bin2dec(arr),unicodeString);
///////////////////
}
else{
GetUnicodeChar(bin2dec(pBinaryData),unicodeString);
return;
}
}
Here for "GetUnicodeChar" function I used it from a stackoverflow question which was converting the decimal code-point to Unicode character .
The full source code for this is here binaryToUtf8
Here I just tried to explain how I implemented the Unicode decoding for a Fast encoded 7-bit stopbit data and the code in github was the initial draft .Hope 8-bit encoded data will also have same solution.
Other sources:-unicode-utf-8-tutorial ,c-windows-decimal-to-utf-8-character-conversion
No comments:
Post a Comment