Unicode encoding and Decoding has become very common now a days.As a developer who uses Object oriented programming ,it is easy to find Unicode encoding decoding inbuilt methods.Most of the current High-level programming have those incorporated into them by default.But when I was working on a project using C which required to parse binary data and show the ASCII/Unicode value I found little bit challenging.We were parsing some FAST encoded messages . We were converting them to a binary string like "10001010010......" and partitioning them with Stop-bit.All was going fine until we encounter Unicode character which had a variable length encoding unlike ASCII.
so to solve this issue I studied about overview of Unicode and code-point . Here is a YouTube link that also help me to understand the concept :
Characters in a computer - Unicode Tutorial UTF-8
I also asked the question in stackoverflow :
Binary to UTF-8 in C
I found that I had to represent the binary char array in the following way i.e considering the Leading and Continuation Bytes and converting Binary Code-point to decimal code-point.
"
UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0x000000 to 0x10FFFF:
so I wrote a function like this which would take character array and Unicode char array(which will be filled with the resulting Unicode character ) and calculate the decimal code point and get the utf-8 value :
void processBinaryData(char* pBinaryData,int32_t n,char *unicodeString){
char arr[n*8],i=1,offset=0;
int m =0;
if(n>1){
size_t len = strlen((char *)pBinaryData);
while(len>0){
len = len-8;
if(i==1){
offset= n+1;
memset(arr, 0, n*8);
for( m=0;m<8-(n+1);m++){
arr[m] = pBinaryData[offset+m];
}
offset = 10;//offset+8-(n+1)+2;
}else{
int j = 0;
for( j=0;j<6;j++){//i*8-2
arr[m] = pBinaryData[offset];
m++;
offset++;
}
offset=i*8+2;
}
i++;
}
GetUnicodeChar(bin2dec(arr),unicodeString);
///////////////////
}
else{
GetUnicodeChar(bin2dec(pBinaryData),unicodeString);
return;
}
}
so to solve this issue I studied about overview of Unicode and code-point . Here is a YouTube link that also help me to understand the concept :
Characters in a computer - Unicode Tutorial UTF-8
I also asked the question in stackoverflow :
Binary to UTF-8 in C
I found that I had to represent the binary char array in the following way i.e considering the Leading and Continuation Bytes and converting Binary Code-point to decimal code-point.
"
UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0x000000 to 0x10FFFF:
00000000 -- 0000007F: 0xxxxxxx 00000080 -- 000007FF: 110xxxxx 10xxxxxx 00000800 -- 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 00010000 -- 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx" ---Source https://www.cprogramming.com/tutorial/unicode.html
so I wrote a function like this which would take character array and Unicode char array(which will be filled with the resulting Unicode character ) and calculate the decimal code point and get the utf-8 value :
void processBinaryData(char* pBinaryData,int32_t n,char *unicodeString){
char arr[n*8],i=1,offset=0;
int m =0;
if(n>1){
size_t len = strlen((char *)pBinaryData);
while(len>0){
len = len-8;
if(i==1){
offset= n+1;
memset(arr, 0, n*8);
for( m=0;m<8-(n+1);m++){
arr[m] = pBinaryData[offset+m];
}
offset = 10;//offset+8-(n+1)+2;
}else{
int j = 0;
for( j=0;j<6;j++){//i*8-2
arr[m] = pBinaryData[offset];
m++;
offset++;
}
offset=i*8+2;
}
i++;
}
GetUnicodeChar(bin2dec(arr),unicodeString);
///////////////////
}
else{
GetUnicodeChar(bin2dec(pBinaryData),unicodeString);
return;
}
}