Nikse Posted November 20, 2010 Author Report Share Posted November 20, 2010 Is there any way to improve the word/OCR detection? I want to use "OCR via image compare", because with the other two options there are way too many OCR errors afterwards in German subtitles... Hi extreme! You really should use Tesseract - if you can make it work and do a bit of "training", I really think it will be the best ocr tool today! 1. First download the latest German hunspell dictionary (Spell check -> Get dictionaries), then 2. download german tesseract 3 language file: http://tesseract-ocr.googlecode.com/files/deu.traineddata.gz - Unpack it to Tesseract\tessdata 3. Download attachment to this post "deu_OCRFixReplaceList.xml" - and save it in the "Dictionaries" folder. 4. Run tesseract on some sub/idx files - when you click "Change all" in the OCR spell check dialog it will be saved in the "deu_OCRFixReplaceList.xml" file. If you could email me one or two German sub/idx files I can perhaps fix the most common errors!deu_OCRFixReplaceList.xml 1 Link to comment Share on other sites More sharing options...
extreme Posted November 20, 2010 Report Share Posted November 20, 2010 Hi. Thanks a lot for the tip. I think it's actually working better now, but I still need to test and train it some more I guess to get even better results. I've sent you an e-mail with two German sub/idx files. Hope you can use it. :) Link to comment Share on other sites More sharing options...
DrJackson Posted November 21, 2010 Report Share Posted November 21, 2010 I have to say you something (OH, NO, not again, nikse said right now ) I noticed something trying to edit some lines. and If you take a close look, what right mouse click show, seems to be changed between, because I need italic, bold and so on options in the field where I edit line. (I realy hope to understand what I want to say ). doc The greatest pleasure in life is doing what people say you cannot do! IMPORTANT LINK: Link to comment Share on other sites More sharing options...
arnozet Posted November 25, 2010 Report Share Posted November 25, 2010 Hi arnozet! I'm sorry, but I don't know what causes this error NHunspell is a third party component which is used for spell checking.Also, Subtitle Edit 3.0 final is out: http://code.google.com/p/subtitleedit/downloads/list Great!! I'll download it. Thanks!! Link to comment Share on other sites More sharing options...
Nikse Posted November 27, 2010 Author Report Share Posted November 27, 2010 If you take a close look, what right mouse click show, seems to be changed between, because I need italic, bold and so on options in the field where I edit line. (I realy hope to understand what I want to say ). Hi doc! I rarely understand anything at all, so you have high expectations I've added a custom context menu to the text field (the other was windows auto-generated) - it's exe only: http://www.nikse.dk/se/SubtitleEdit.rar (also has new "Auto-backup" feature in settings). And where did all that snow come from!? Link to comment Share on other sites More sharing options...
honeybunny Posted November 27, 2010 Report Share Posted November 27, 2010 Hey. I don't know if it's a bug or i do smth wrong but when i change case and i have to review the words that will be capitalized, i can uncheck one word so that will not be capitalized but in the lower section where the text is, unchecking one line with a word will uncheck all with that word, not only that line. But maybe agent, for example, needs capitalized in one and doesn't in another. Link to comment Share on other sites More sharing options...
Nikse Posted November 27, 2010 Author Report Share Posted November 27, 2010 Hey. I don't know if it's a bug or i do smth wrong but when i change case and i have to review the words that will be capitalized, i can uncheck one word so that will not be capitalized but in the lower section where the text is, unchecking one line with a word will uncheck all with that word, not only that line. But maybe agent, for example, needs capitalized in one and doesn't in another. "Change casing - names" contains two list views. The upper list view contains names found, and the lower list view contains lines where these names are used. If you un-check a names in the upper list view, then lines with this name will no longer appear in the lower list view (unless it contains another name) and will not have casing changed. So to un-check some lines only, use the lower list view check-boxes. Perhaps it would be better to run "Change casing" separately for troublesome names, so lines with multiple names will still be correct. I hope I understood what you meant, and visa versa ;) Link to comment Share on other sites More sharing options...
honeybunny Posted November 28, 2010 Report Share Posted November 28, 2010 "Change casing - names" contains two list views. The upper list view contains names found, and the lower list view contains lines where these names are used. If you un-check a names in the upper list view, then lines with this name will no longer appear in the lower list view (unless it contains another name) and will not have casing changed. So to un-check some lines only, use the lower list view check-boxes. Perhaps it would be better to run "Change casing" separately for troublesome names, so lines with multiple names will still be correct. I hope I understood what you meant, and visa versa Nevermind, it seems it's solved. it used to be different. [if I unchecked just one of the lines in the lower section, all the lines containing that word found in the line i unchecked were automatically unchecked.] Thanks :) Link to comment Share on other sites More sharing options...
DrJackson Posted November 28, 2010 Report Share Posted November 28, 2010 [...] I rarely understand anything at all, so you have high expectations [...] And where did all that snow come from!? You rarely understand, I can't make myself understood, what a team Which snow? doc. The greatest pleasure in life is doing what people say you cannot do! IMPORTANT LINK: Link to comment Share on other sites More sharing options...
Nikse Posted December 24, 2010 Author Report Share Posted December 24, 2010 You rarely understand, I can't make myself understood, what a team Unbeatable Which snow? I don't remember the last time I've seen so much snow here in Denmark! But at least it's no so dark here... I was working on a file when suddenly got an error... Might work better in the 3.1 beta 1 update (with a newer version of NHunspell) SE 3.1 beta 1 update: http://www.nikse.dk/se/SE31Beta1Update.rar Change log; * NEW: * Collaboration via the internet ("Networking", also has chat) * Auto-backup (never, every minute, every 5th minute, or every 15th minute) * Ability to remember the last selected line when re-opening subtitles * Support for the subtitle format "Quicktime text" (two variations) * Added "Chars/sec" info to textbox in main window * Options to choose font color and background color (for list view/text-boxes) * Can now import VobSub subtitles embedded in Matroska (.mkv) files. * IMPROVED: * Context menu for subtitle textbox now has italic, bold, underline, font name, and color * Updated NHunspell (spell check component) to latest version (0.9.6) * Synchronization Show earlier/later changed a bit, also added short cut (Ctrl+Shift+A) * Main window: Video player will now automatically move up beside subtitle if waveform is displayed + some re-sizing of controls allowed * FIXED: * OCR Fix Engine: Lines after "..." will no longer be changed to start with uppercase letter * Fixed missing line break in Sony Dvd Architecht (w line numbers) - thx Rosa * Fixed a minor bug in initialization of waveform - thx Frederic! * Fixed a minor bug in Visual Sync, if end scene was after video length Oh, and merry Christmas! Link to comment Share on other sites More sharing options...
Kerensky Posted December 24, 2010 Report Share Posted December 24, 2010 Great improvements Nik! And I'm sure there are a lot of more of those to come Merry Xmas! [Kerensky] Transcript Annotations Cleaner v26-12-2010[Kerensky] Automatic Subtitle Synchronizer v12-01-2010 Link to comment Share on other sites More sharing options...
DrJackson Posted January 19, 2011 Report Share Posted January 19, 2011 I think someone missed me. First things first.1. SE has some very useful features, ie, "Insert new subtitle at the video position". After I marveled to myself how many things know SE how to do them, I start to think how it can do, especially that feature. And I thought it was possible by allocate time to display that line, counting the characters that compose it. I could be wrong, but it would be logical to happen in that way. Good .... Now, as a lazy synchronizer subtitles that I am, I encountered a problem. In many transcripts, I met lines that need to be splitted, to be fulfil some certain requirements such as maximum length (maximum number of characters per line). The program makes that, and it make it very well: it takes that line and it split it, giving equal time to both line that results, regarding the punctuation marks, and other program settings.BUT ... Special Case: imagine a line like: -How to split? - Splitting long line is made considering punctuation, line is split in two, and displaying time is properly allocated. By using the option in the program, the result will be something like: -How to split? Splitting long line is made considering punctuation, line is split in two, and displayed time is properly allocated. or (this is most encounter) -How to divide? -Splitting long line is made considering punctuation, line is split in two, and diplaying time is properly allocated. and time is splited equally between the two lines that results. Here's the problem! The first line is needed less time to display, right? So, is it possible that when I'm using the option of split, the output lines having displaying times corresponding to the number of characters displayed? 2. In so many cases, transcripts are full of "Awe", "Oh", "Whew", "Pfui" and so many other very representative words.Is it posible to add those words to Remove text for HI option? ( I guess here we need to have more "talk" about). doc. The greatest pleasure in life is doing what people say you cannot do! IMPORTANT LINK: Link to comment Share on other sites More sharing options...
Nikse Posted January 21, 2011 Author Report Share Posted January 21, 2011 I think someone missed me. Always Especially now that you are the only one from Stargate not cancelled... 1 ) Now, as a lazy synchronizer subtitles that I am, I encountered a problem. Yes, lazy ppl use good software In SE 3.1 beta 9 the splitting should work better: http://www.nikse.dk/se/se31beta9.zip Other SE news: - Use MS Word for spell check (EDIT: It's an option) - Edit original sub at the same time as new translation (EDIT: It's an option) 2) I will look at it later. Link to comment Share on other sites More sharing options...
DrJackson Posted January 22, 2011 Report Share Posted January 22, 2011 Yes, lazy ppl use good software That's for sure, without any doubt!. Thank you Nik, and keep on your good work. doc. The greatest pleasure in life is doing what people say you cannot do! IMPORTANT LINK: Link to comment Share on other sites More sharing options...
Alex1969 Posted January 23, 2011 Report Share Posted January 23, 2011 In so many cases, transcripts are full of "Awe", "Oh", "Whew", "Pfui" and so many other very representative words.Is it posible to add those words to Remove text for HI option? ( I guess here we need to have more "talk" about). This kind of words are not considered as HI annotations, IMHO. It's not the years in the life, but the life in the years Link to comment Share on other sites More sharing options...
Nikse Posted January 23, 2011 Author Report Share Posted January 23, 2011 This kind of words are not considered as HI annotations, IMHO. He, what would you call it then? Just found a couple of similar examples: W-wait! l-impossible! Aah! Here it comes! Link to comment Share on other sites More sharing options...
DrJackson Posted January 23, 2011 Report Share Posted January 23, 2011 This kind of words are not considered as HI annotations, IMHO. He, what would you call it then? [...] Seems Nikse was getting my point. In a non HI subtitle version, those types of interjections are useless. Especially as, for a translator, these interjections are nothing more than things that gives him big headaches, like:What to do? To keep a line with Ahh!, Ohh, Uhh!, Ehh! and translate it, or to remove it?How is more easy? To remove line by line or word by word, or to use a small interjection's dictionary an remove them all with one click? Ohh! , we can discuss about where to put that option in SE, that's correct! But that option from TOOLS-Remove text for HI seems to be the most appropriate. doc. P.S. That split option, right now, it is just...PERFECT! The greatest pleasure in life is doing what people say you cannot do! IMPORTANT LINK: Link to comment Share on other sites More sharing options...
Nikse Posted January 23, 2011 Author Report Share Posted January 23, 2011 In a non HI subtitle version, those types of interjections are useless. OK, so they are called "Interjections" - English is obviously not my native language Should Subtitle Edit have this feature in "Remove text for Hearing Impaired" or in "Fix common errors" ? P.S. That split option, right now, it is just...PERFECT! Super :) Link to comment Share on other sites More sharing options...
DrJackson Posted January 23, 2011 Report Share Posted January 23, 2011 Should Subtitle Edit have this feature in "Remove text for Hearing Impaired" or in "Fix common errors" ? I don't think those interjections are errors. More of that, Remove... has a feature that allow an user to choose what to do, right? doc. The greatest pleasure in life is doing what people say you cannot do! IMPORTANT LINK: Link to comment Share on other sites More sharing options...
Alex1969 Posted January 25, 2011 Report Share Posted January 25, 2011 I was referring to the sounds/words that a character uses to verbally express various feelings (surpise, disgust, happiness, pain, etc). Since the actor is actually saying them out loud, they shouldn't be considered part of the HI text. Anyway, doc is right, they could be added to the list for "Removing text for hearing impaired" and if necessary, the corrector can chose to keep them in the text, since there's already that option ;) It's not the years in the life, but the life in the years Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now