1. Crawl the Chinese characters from a Big5 encoded webpage which most likely has this HTML markup:
<meta http-equiv="Content-Type" content="text/html; charset=big5">
2. Convert them to UTF-8 characters and insert them into a MySQL database table with UTF-8 encoding on a remote machine.
3. Use Putty to log in to the remote machine and view the non-garbage records correctly in the MySQL database table.
And you want to read the Chinese characters as UTF8 encoded characters. You certainly don't want to see gibberish like this:
mysql> select * from stock_test;
+------+---------------+
| id | name |
+------+---------------+
| 1101 | �泥 |
| 1102 | 亞泥 |
| 1103 | 嘉泥 |
| 1104 | ç’°æ³¥ |
| 1108 | 幸� |
| 1109 | 信大 |
| 1110 | �泥 |
+------+---------------+
7 rows in set (0.00 sec)
+------+---------------+
| id | name |
+------+---------------+
| 1101 | �泥 |
| 1102 | 亞泥 |
| 1103 | 嘉泥 |
| 1104 | ç’°æ³¥ |
| 1108 | 幸� |
| 1109 | 信大 |
| 1110 | �泥 |
+------+---------------+
7 rows in set (0.00 sec)
Instead, you want to see correctly encoded Chinese characters like this:
mysql> select * from stock_test;
+------+--------+
| id | name |
+------+--------+
| 1101 | 台泥 |
| 1102 | 亞泥 |
| 1103 | 嘉泥 |
| 1104 | 環泥 |
| 1108 | 幸福 |
| 1109 | 信大 |
| 1110 | 東泥 |
+------+--------+
7 rows in set (0.00 sec)
+------+--------+
| id | name |
+------+--------+
| 1101 | 台泥 |
| 1102 | 亞泥 |
| 1103 | 嘉泥 |
| 1104 | 環泥 |
| 1108 | 幸福 |
| 1109 | 信大 |
| 1110 | 東泥 |
+------+--------+
7 rows in set (0.00 sec)
Exactly what should you do in order to see correct UTF-8 Chinese text instead of garbage characters or junk text when you query MySQL?
本文章會教你如何把 Big5 文字轉換成 UTF-8 文字,並正確的儲存到 UTF-8 MySQL 資料庫,你又可以用 Putty 讀到正確的中文字資料,而不是亂碼。
Create MySQL table
Here's how you would create a MySQL database table with UTF-8 encoding:
create table stock_test (
id varchar(16) not null,
name varchar(16) not null,
PRIMARY KEY (id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
id varchar(16) not null,
name varchar(16) not null,
PRIMARY KEY (id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
The following way is less recommended but it also works:
create table stock_test (
id varchar(16) not null,
name varchar(16) not null,
PRIMARY KEY (id)
) DEFAULT CHARSET=utf8;
id varchar(16) not null,
name varchar(16) not null,
PRIMARY KEY (id)
) DEFAULT CHARSET=utf8;
Create crawl script
I use PHP and Eclipse IDE to develop my crawl script to crawl data from a Big5 encoded webpage. I have to make sure the following is true:
- the script's file encoding is set to UTF-8 (right click on the file in Eclipse and select Properties)
- I execute the MySQL command "SET NAMES UTF8" after I connect to the database. The code looks like this:
$conn = new mysqli($servername, $username, $password, $dbname);
$conn->query("SET NAMES UTF8");
- I execute the MySQL command "SET NAMES UTF8" after I connect to the database. The code looks like this:
$conn = new mysqli($servername, $username, $password, $dbname);
$conn->query("SET NAMES UTF8");
Once you crawl the Big5 encoded characters from the webpage, call some PHP function to convert them to UTF-8 characters. Here's the function I use:
$converted_text = iconv($sourceCharset, $targetCharset, $text);
In our case, the function call would look like this:
$converted_text = iconv('big5', 'UTF-8', $text);
Insert $converted_text into your MySQL database table.
Use Putty to log in to remote server
In Window -> Translation, make sure the "Remote character set" in the "Character set translation" section is set to UTF-8.
Connect to the remote server with Putty. My remote box is running Ubuntu 14.04.5 LTS.
Log in to MySQL server
Run this command to use the MySQL command-line tool:
mysql --default-character-set=utf8 -u user_name -ppassword stock_test
Depending on your MySQL settings, you may not need the option --default-character-set=utf8 but it doesn't hurt.
Run the following usual MySQL command to select all data from table stock_test:
select * from stock_test;
And you should see the correctly encoded Chinese characters with no gibberish.
Questions? Let me know!