Jump to content

Strang characters in sub


Dawg

Recommended Posts

The sub for CSI Miami - 08x23 - Time Bomb.LOL.English.org contains some strange characters in the following time ref's.


00:00:42,632 --> 00:00:44,900
00:01:14,597 --> 00:01:16,865

Link to comment
Share on other sites


¶ ¶?

These are some characters to suggest music :)



No these are more like a capital T with an extended top bar and 2 lines right net to it.

post_727_1274289834_74c20bd1e5e81b373e0c

Link to comment
Share on other sites


Lol they don't look like that to me :))



Ok I'll bite what do they look like to you?

BTW Thanks for recommending VLC program, it's saving me from a lot of head aches caused by the DivX player.
Link to comment
Share on other sites


¶ ¶ like this. I posted them above :D I wonder how you see this.



I see em as there displayed here, and as they display in VLC, what i was looking at (the attachment above) was the "raw" (so to speak) character in the sub, thus I was thinking it was a "unprintable" (again so to speak) character, as I hadn't gotten around to watching that episode yet.

Now having seen the episode I saw that indeed (as you said) that it was what some of you folks use to denote music.

Question, Why not use the actual music note symbol?
Link to comment
Share on other sites

  • 2 weeks later...
  • 3 months later...

You'd have to ask the guys at the captioning companies :) We just sync the scripts.


This is an old thread but I was wondering about this myself. Watching the True Blood finale in VLC, I notice the songs all have odd characters. I know they're supposed to be the characters signifying music but they don't appear that way in VLC. I'm guessing if I made a DVD they might be fine. In VLC they look just like what's in the subtitle itself.

<i>â?ª And every shadow â?ª</i>
Link to comment
Share on other sites


u can replace them with * before watching. the encoding does not always work.


(See also here: )

Guys, a little Unicode primer here.

ASCII characters (bytes 0-127) are stored the same way in a file with all currently-used encodings. The issue begins when wanting to store non-ASCII characters; everything put to a file MUST be converted to bytes, and those bytes can mean many things, depending on the encoding used.

Now, I understand the site stores the files internally as UTF8. That means that:
“é” (Unicode character 233) as UTF8 becomes “é” (2 bytes with values [195, 169]). When uploading through the web interface, I've seen that there is no way to tell the form that the subtitle is already encoded as UTF8, so choosing e.g. English as the language, the site interprets the incoming data as CP1252 (Windows Western). In CP1252, every character takes exactly one byte, so the site converts every byte to UTF8, and the two bytes “é” are thought to be interpreted as two Unicode characters, so they are UTF8 converted to “é” (4 bytes: [195, 131, 194, 169]) and stored like this. When you download the subtitle and use it, the player understands that this is a UTF8 encoded file, so it decodes UTF8 and the 4 bytes of my example become 2 Unicode characters: “é” which you see on your screen.

In the example of the character “?”:
this is the Unicode character 9834, named “EIGHTH NOTE”. Stored as UTF8, it becomes “♪” (3 bytes: [226, 153, 170]), and quite possibly the uploader sees the subtitle correctly in their player. If the file was stored exactly like that in the site, everything would be fine; however the process that I described occurs during the upload, so the site interprets the 3 bytes as CP1252, decodes them into 3 separate Unicode characters (“♪”) instead of 1 Unicode character (“?”) and encodes them as UTF8 into 6 bytes: “♪” (bytes: [195, 162, 226, 132, 162, 194, 170]). You download the file, the media player understands it's a UTF8 encoded file, so it decodes UTF8 (and it does this decoding ONCE) and the 6 bytes become 3 characters “♪”, which are the ones shown on your screen.

Now, sometimes the input files are not UTF8 encoded, but GBK encoded (another way to store Unicode in a file, which applies to Chinese), and this happens often to subtitles acquired from yyets.net (something like that); there the thing becomes more troublesome, since during the upload the GBK-encoded bytes are decoded as CP1252 and then encoded and stored as UTF8. Hell broke loose.

Confusing? Sorry :)

I have a Python script that fixes these things automatically (95% of the time) and produces a correct UTF8 file; I can make that script available to uploaders and editors, who hopefully upload not using the web interface.
However, a way to upload raw bytes to the site (without any encoding/decoding process) MUST be created for us lame uploaders, so that these issues can be solved much easier, or become non-existent in the first place.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Member Statistics

    27527
    Total Members
    6268
    Most Online
    Felixmet
    Newest Member
    Felixmet
    Joined
×
×
  • Create New...