I Never Did Like Latin. From MySQL Latin-1 to Utf8mb4

I took Latin in High School. Never liked it much. The only thing I remember is stultus asinus, or “stupid donkey.” Well, the stupid donkey was me when I realized I didn’t have my MySQL table properly configured to handle Unicode Encoding.

We just put the wraps on a new app that migrates WordPress Blog data to NixMash Spring. The processing mostly takes place between databases, but we use the WordPress REST API to import rendered WordPress post content in the NixMash Spring posts.post_content field.

This should not normally be a problem, but WordPress likes to use non-Latin HTML 4.0 special characters like ’ for RIGHT SINGLE QUOTATION MARK instead of a simple ' quote symbol. (It also strips <P/> tags, which is why we’re using the REST API postContent().rendered() in the first place.) When that HTML content was saved to the database, this happened.

Keywords: MysqlDataTruncation, Hexadecimal

Latin-1 Legacy

We’ll start with the initial configuration of the MySQL posts table, database and server, where Latin1 Character Sets were causing problems.

First the database and server where we see a bit of a mix-mash between utf8 and latin1.

Now let’s look at the actual posts.post_content column. Yeah, I know. Let’s just say this table and database was created a long time ago.

For a more complete picture of the issue at hand, here’s the output of SHOW FULL COLUMNS.

UTF8MB4

We can change Character Set and Collation at the server level, the database, the table or even the column level. Here are some approaches for database, table or column.

We want to be thorough and want to not deal with Latin-1 Character Set handling in the future, so we’re updating the MariaDB Server Configuration in /etc/mysql/mariadb.conf.d/50-server.cnf as well.

Updated Character Set and Collation Settings

Let’s look at the final result. Here’s our SHOW FULL COLUMNS output. I should mention that the HTML migration worked with only the post_content Character Set changed to UTF8MB4, but we wanted to be thorough.

The updated Server and Database Character Set variables.

Takeaways

I did a lot of reading about Unicode, MySQL Character Sets and Collation since this HTML import exception happened. Here are some good takeaway bits and links.

An interesting post from Nick Johnson on the WordPress Apostrophe issue.

Wikipedia’s list of XML and HTML Character Entities.

“Unicode is certainly difficult, and the UTF-8 encoding has a couple of inconvenient properties. However, UTF-8 has become the defacto standard encoding on the web, surpassing ASCII, Latin-1, UCS-2 and UTF-16. Just use UTF-8 everywhere. The most important reason why you should support Unicode is that you shouldn’t make unnecessary assumptions about user input…So not supporting other scripts isn’t just a big f*ck you to other cultures, but sticking to Latin-1 doesn’t even allow you to write proper English.”

“Latin-1 is a character set, UTF-8 is a character encoding. Comparing the two doesn’t even make sense. Those are two completely different things. UTF-8 is in fact a valid encoding of Latin-1.”

–two responses to a Software Engineering StackExchange Question

“UTF-8 is prepared for world domination, Latin1 isn’t.” –a StackOverflow Answer

Getting out of MySQL Character Set Hell. A very long post from 2009 but very much still worth the read.

From the previous “Character Set Hell” post, what to do before screwing with your Server Character Set and Collation Settings.

# service mysql stop
# cd /var/lib
# tar zcvf mysql-backup.tar.gz mysql
# service mysql start

How do I see what character set a MySQL database / table / column is?
— StackOverflow

Differences between utf8_general_ci and utf8_unicode_ci. Includes links to their collation symbol charts. From a MySQL Forums response.

“The Unicode Standard includes characters from the Basic Multilingual Plane (BMP) and supplementary characters that lie outside the BMP. Utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length. For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it.” —MySQL Unicode Documentation

I Never Did Like Latin. From MySQL Latin-1 to Utf8mb4

Latin-1 Legacy

UTF8MB4

Updated Character Set and Collation Settings

Takeaways

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112