• RSS
  • Twitter
  • FaceBook

Security Forums

Log in

FAQ | Search | Usergroups | Profile | Register | RSS | Posting Guidelines | Recent Posts

Cleaning up html files saved as text files

Users browsing this topic:0 Security Fans, 0 Stealth Security Fans
Registered Security Fans: None
Post new topic   Reply to topic   Printer-friendly version    Networking/Security Forums Index -> UNIX // GNU/Linux

View previous topic :: View next topic  
Author Message
JoeS
Just Arrived
Just Arrived


Joined: 08 Apr 2005
Posts: 1


Offline

PostPosted: Tue Jun 03, 2008 3:04 pm    Post subject: Cleaning up html files saved as text files Reply with quote

Use web browser: firefox, Konqueror

There are files I save from the web that I would prefer to keep as text files.

When I save as text file or copy and paste some of the punctuation (such as " ' and -) is converted to ? In some files there can be a lot.

Maybe there is another web browser or program I could use.
I would appreciate any advice on cleaning up an html file after it is saved as a text file.

Thanks.
Back to top
View user's profile Send private message
Elderan
Just Arrived
Just Arrived


Joined: 08 Jun 2007
Posts: 0


Offline

PostPosted: Sun Jun 08, 2008 1:28 pm    Post subject: Reply with quote

Hi,
the problem is not your webbrowser, it's because of the encoding of the text. Many pages uses UTF-8, but when you save it in ASCII mode, some signs (special character) are displayed as ?.

Save the data in UTF-8, or just use the Save-As function of your browser.
Back to top
View user's profile Send private message
capi
SF Senior Mod
SF Senior Mod


Joined: 21 Sep 2003
Posts: 16777097
Location: Portugal

Offline

PostPosted: Sun Jun 08, 2008 5:00 pm    Post subject: Reply with quote

Exactly. The problem here is that the original html contains characters that don't exist in the reduced ASCII set. Things like curved quotes (“) or the Euro sign (€), for example.

When saving to text, the browser is probably saving either to strict 7-bit ASCII (or maybe ISO-8859-1, also known as Latin1), or to the encoding specified by your locale settings. The problem is that whichever encoding it's using seems to not include some of the original characters.

The solution would be to normalize the characters so that fancy stuff like curved quotes and so on is transformed to more standard characters like ". This, however, may not be easy to accomplish from the browser.

As Elderan pointed out, saving as UTF-8 would be another solution - as UTF-8 can by definition encode all Unicode characters. This would mean, however, that you'd need a text editor that can understand UTF-8 to read the fancy characters in the text file, but that shouldn't be a problem for virtually every modern text editor.

Unfortunately I don't really know if there's a way to choose the encoding used when you save text in Firefox, or which encoding it uses to save in the first place. Perhaps asking in the Mozilla forums might help.
Back to top
View user's profile Send private message
Display posts from previous:   

Post new topic   Reply to topic   Printer-friendly version    Networking/Security Forums Index -> UNIX // GNU/Linux All times are GMT + 2 Hours
Page 1 of 1


 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Community Area

Log in | Register