Jump to content

subtitle coding


sumquodsum

Recommended Posts


The file was created by my Python script; its name is "check_timing.py", but I don't know why you thought that specifying the program would help you. Yes, there is a BOM at the beginning of the file (for Windows users), although the line endings are *nix style (just LF, not CR+LF), because I never met a media player that fails to read the file because it has *nix-style line endings, so why store them with CR+LF on my Linux computer?

As for the dropbox.com folders: there are tons of subtitle sites, why not use one? Get the subtitle from here. (In case there is an issue with links to other subtitle sites, the previous link points to opensubtitles dot org, and the path to the page is en/subtitles/3828375/true-blood-en ).


dny was asking for the original subtitle before you use your phyton script on it, so he can make some tests.
If you have any good php script to detect encoding, is very well welcome... external apps, just don't do the trick.
Link to comment
Share on other sites


The file was created by my Python script; its name is "check_timing.py", but I don't know why you thought that specifying the program would help you. Yes, there is a BOM at the beginning of the file (for Windows users), although the line endings are *nix style (just LF, not CR+LF), because I never met a media player that fails to read the file because it has *nix-style line endings, so why store them with CR+LF on my Linux computer?

As for the dropbox.com folders: there are tons of subtitle sites, why not use one? Get the subtitle from here. (In case there is an issue with links to other subtitle sites, the previous link points to opensubtitles dot org, and the path to the page is en/subtitles/3828375/true-blood-en ).


I downloaded the srt in the zip from the opensubtitles link you included.
Then i turned around and uploaded that to Addic7ed. Checked it and it seems fine.

Upon exporting it, it was clear that addic7ed.com isn't including the BOM, but the file seemed to work ok for me. What do you have that's recognizing the file incorrectly?
D
Link to comment
Share on other sites

What do you have that's recognizing the file incorrectly?


1. Did you re-download the file from addic7ed? I just did. It comes out as I described previously.

More technical: the last line of the downloaded file contains the following bytes (Unicode encoded as UTF-8, decoded as CP1252 and encoded as UTF-8):

"\xc3\xa2\xc2\xac\xc2\x8424000\xc3\x83\xc2\xb71001\xc3\xa2\xc2\xac\xc2\x84\r\n\r\n"

These bytes, should be (Unicode encoded as UTF-8):

"\xe2\xac\x8424000\xc3\xb71001\xe2\xac\x84\r\n\r\n"

The bytes above seen as Unicode characters:

?24000÷1001?



2. Just “view & edit” the file online. Go to the last two pages. What do you see? Eighth-notes “?” or “â?ª? I see the latter.

Link to comment
Share on other sites


dny was asking for the original subtitle before you use your phyton script on it, so he can make some tests.
If you have any good php script to detect encoding, is very well welcome... external apps, just don't do the trick.

It seems to me that the site code binds the language of the subtitle with the encoding; I strongly believe it shouldn't. I could read some php documentation in order to learn enough of it and convert my Python encoding-detection module into it, but there is really no need; you can transfer the responsibility to the uploader.

What I propose, is: in the upload page, right next to the Language listbox, add an Encoding one, with the “Default” option pre-selected and a note to the uploader: “If you don't know what the encoding of your subtitle is, leave it to "Default".” Add a “UTF-8” option below the "Default" choice. Use the uploader's choice if not “Default”, otherwise proceed as usual. That's all.

PS I know very little PHP; from what I've seen browsing issues on stackoverflow.com , only recent versions of PHP have decent support for Unicode handling. So I'll ask the following: is there some way in the PHP you use for the site to attempt conversion FROM UTF-8, and if that fails, then use the default encoding for the chosen language? I mean, UTF-8 is very recognizable, and it's very, very uncommon for a non-UTF-8 encoded file to be confused as UTF-8. If I may give an example in Python of what I mean:

try:

   unicode_text= input_text.decode("UTF-8")

except UnicodeDecodeError:

   unicode_text= input_text.decode("CP1252") # or CP1253 if chosen Language is Greek etc.


Is something like that possible during the upload process?

Link to comment
Share on other sites

I made some edits to the code. Can you retest it sometime and let me know if it's fixed? Thanks, D

Yes! It worked: http://www.addic7ed.com/list.php?id=31333&fversion=0?=1 You can delete this subtitle after you verify.

I'm also gonna try to upload a greek UTF-8 subtitle there. Thank you.

PS It doesn't work at the moment for Greek: http://www.addic7ed.com/list.php?id=31333&fversion=1&lang=27 . It tries to do the conversion as I described before, only using CP1253 instead of CP1252. I assume it currently doesn't work for any other language, too.
Link to comment
Share on other sites


Yes! It worked: http://www.addic7ed.com/list.php?id=31333&fversion=0?=1 You can delete this subtitle after you verify.

I'm also gonna try to upload a greek UTF-8 subtitle there. Thank you.

PS It doesn't work at the moment for Greek: http://www.addic7ed.com/list.php?id=31333&fversion=1&lang=27 . It tries to do the conversion as I described before, only using CP1253 instead of CP1252. I assume it currently doesn't work for any other language, too.


The site was original geared to work for loads of Cyrillic based languages. What encoding is the source srt for Greek? (apparently not UTF-8 eh?)
Can you give me a URL to the Greek one so I can test with it?
Thanks,
D
Link to comment
Share on other sites

The site was original geared to work for loads of Cyrillic based languages. What encoding is the source srt for Greek? (apparently not UTF-8 eh?) Can you give me a URL to the Greek one so I can test with it? Thanks, D

Hm. I'll try to be as explicit I can and provide the most information possible in order to help you.

I attempted to upload a UTF-8 encoded Greek subtitle ; you can view & edit the subtitle here: http://www.addic7ed....rsion=1?=27
Note that the subtitle data I uploaded was interpreted as CP1253, (the Windows Greek codepage); how do I know that? Because of the most prominent "Ξ" character; this is the byte 206, which is used as the first byte in most UTF8-encoded Greek characters. The fact that it shows in the subtitle-edit-page as "Ξ" and not "Î" (206 decoded as CP1252) tells me that someone in the site already has made the correlation that CP1253 is the typical Greek codepage.

Now, I'm attaching test-greek-subtitle.zip a very small sample subtitle for your tests; the zip file contains two subtitle files: "test-greek-subtitle-cp1253.srt" and "test-greek-subtitle-utf8.srt". Both of them, after uploading, should have the same contents:

1
00:00:00,800 --> 00:00:03,050
<i>Στα προηγούμενα…</i>

2
00:00:03,334 --> 00:00:06,200
–Ο Τόμας είναι ο καλύτερος φίλος μου.
–Τι σκαρώνει;

3
00:00:06,240 --> 00:00:09,161
Αυτός που σε ενδιαφέρει
είναι ο Ρόι, όχι ο Μπλουμ.

4
00:00:09,201 --> 00:00:12,788
Ο Μάιλς εντόπισε
κάποια Τανάζ Σαχάρ.

5
00:00:13,087 --> 00:00:15,614
Ο Σπάγκλερ έχει ένα πρες-παπιέ
της Άτλας ΜακΝτάουελ.

PS Damn, this forum should be internationalized, really.
Link to comment
Share on other sites


Hm. I'll try to be as explicit I can and provide the most information possible in order to help you.

I attempted to upload a UTF-8 encoded Greek subtitle ; you can view & edit the subtitle here: http://www.addic7ed....rsion=1?=27
Note that the subtitle data I uploaded was interpreted as CP1253, (the Windows Greek codepage); how do I know that? Because of the most prominent "?" character; this is the byte 206, which is used as the first byte in most UTF8-encoded Greek characters. The fact that it shows in the subtitle-edit-page as "?" and not "Î" (206 decoded as CP1252) tells me that someone in the site already has made the correlation that CP1253 is the typical Greek codepage.

Now, I'm attaching test-greek-subtitle.zip a very small sample subtitle for your tests; the zip file contains two subtitle files: "test-greek-subtitle-cp1253.srt" and "test-greek-subtitle-utf8.srt". Both of them, after uploading, should have the same contents:

1
00:00:00,800 --> 00:00:03,050
<i>??? ???????????…</i>

2
00:00:03,334 --> 00:00:06,200
–? ????? ????? ? ????????? ????? ???.
–?? ????????;

3
00:00:06,240 --> 00:00:09,161
????? ??? ?? ??????????
????? ? ???, ??? ? ??????.

4
00:00:09,201 --> 00:00:12,788
? ????? ????????
?????? ????? ?????.

5
00:00:13,087 --> 00:00:15,614
? ???????? ???? ??? ????-?????
??? ????? ??????????.

PS Damn, this forum should be internationalized, really.


The UTF8 file in the zip worked ok for me, from what I can tell, "It's all greek to me :)".

This is the conversion function in PHP that's called in code.

$texto = iconv($charset, "UTF-8", $texto);

Interestingly enough the system detects the encoding is iso-8859-7 (not CP1253) in part because mb_detect_encoding crashes if you put in the CP versions.
If I remove the mb_detect_encoding and use CP1253, then all is good.

Can you confirm that you had problems with the UTF8? I'm wondering if my earlier fixes to UTF-8 have corrected the problem sufficiently.
Also, can you test your greek and utf files with http://www.addic7ed.com/newsub.dny.php as the upload page?

It seems like I need to make a version of the page which is capable of dealing with the iso and the cp code tables since neither is a superset of the other.
http://www.cs.tut.fi/~jkorpela/unicode/greek.html

D
Link to comment
Share on other sites

Interestingly enough the system detects the encoding is iso-8859-7 (not CP1253) in part because mb_detect_encoding crashes if you put in the CP versions.

It seems like I need to make a version of the page which is capable of dealing with the iso and the cp code tables since neither is a superset of the other.
http://www.cs.tut.fi...code/greek.html

You really don't need to cater for iso-8859-7. That was the official greek codepage, but those users (Unix/Linux) that used it don't use it anymore, since we've moved on to UTF-8 like the rest of the sane world. Any non-UTF8 greek subtitle you'll find on the net, it will be CP1253 (and for subtitle purposes they're the same, except for the accented greek capital alpha, which in CP1253 was moved by Microsoft because it broke the end-of-paragraph character in MS Word).
So, any greek subtitle you get, will be either CP1253 or UTF-8.
The upload page you created: is it only for testing/uploading greek subtitles?
Link to comment
Share on other sites


You really don't need to cater for iso-8859-7. That was the official greek codepage, but those users (Unix/Linux) that used it don't use it anymore, since we've moved on to UTF-8 like the rest of the sane world. Any non-UTF8 greek subtitle you'll find on the net, it will be CP1253 (and for subtitle purposes they're the same, except for the accented greek capital alpha, which in CP1253 was moved by Microsoft because it broke the end-of-paragraph character in MS Word).
So, any greek subtitle you get, will be either CP1253 or UTF-8.
The upload page you created: is it only for testing/uploading greek subtitles?


Good to know.

The uploads to that page are legit. But I created the page for testing.
If there aren't issues then I will move the changes live for everyone.
Link to comment
Share on other sites


Good to know.

The uploads to that page are legit. But I created the page for testing.
If there aren't issues then I will move the changes live for everyone.


OK, because I tried to re-upload as UTF-8 the (GBK-encoded) House 7x01 subtitle by elderman, but it still didn't understand the subtitle was already in UTF-8. I also tried to upload it from your test page, but since the episode existed, it didn't allow me to do it.
Link to comment
Share on other sites


OK, because I tried to re-upload as UTF-8 the (GBK-encoded) House 7x01 subtitle by elderman, but it still didn't understand the subtitle was already in UTF-8. I also tried to upload it from your test page, but since the episode existed, it didn't allow me to do it.


Do I have a zip of that one yet?
D
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Member Statistics

    27527
    Total Members
    6268
    Most Online
    Felixmet
    Newest Member
    Felixmet
    Joined
×
×
  • Create New...