Maker Pro
Maker Pro

OT: Copying text from a PDF

T

Terry Pinnell

Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?
 
L

Leon Heller

Terry Pinnell said:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

I just tried it and it worked OK for me when I pasted the text into the PFE
editor. Here are a couple of lines:

Drain to Source Breakdown Voltage (Note 1) . . . . . . . . . . . . . . . . .
.. . . . . . . . . . . . . . . . . . . . . .V
DS
50 V
Drain to Gate Voltage (R
GS
= 20k
Ù
) (Note 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . . . V
DGR
50 V
Continuous Drain Current T
C

It's not perfect, but I haven't got a CR after every character.

I often extract text from PDFs whan creating PCB parts, and don't have many
problems.

Leon
 
G

Glenn Gundlach

Leon said:
I just tried it and it worked OK for me when I pasted the text into the PFE
editor. Here are a couple of lines:

Drain to Source Breakdown Voltage (Note 1) . . . . . . . . . . . . . . . .. .
. . . . . . . . . . . . . . . . . . . . . .V
DS
50 V
Drain to Gate Voltage (R
GS
= 20k
Ù
) (Note 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
. . . . V
DGR
50 V
Continuous Drain Current T
C

It's not perfect, but I haven't got a CR after every character.

I often extract text from PDFs whan creating PCB parts, and don't have many
problems.

Leon
I was just doing exactly that from a Motorola (Freescale) PDF for a
software simulation. It doesn't handle tabs (are there any in a PDF?)
and deletes 'whitespace'. BUT, it didn't take very long to restore the
spacing. Not great but easier than typing the whole deal.
GG
 
Terry Pinnell said:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.

What's your text editor? Assuming you're under Windows, perhaps the
problem is trying to paste Unicode into an editor that can't handle it.
You might try pasting the text into Word or Wordpad to see what happens.

You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows, but the command-line
utilities, including a PDF-to-text converter, will work.

Matt Roberds
 
M

Mike Monett

Leon Heller wrote:
[...]
It's not perfect, but I haven't got a CR after every character.

I often extract text from PDFs whan creating PCB parts, and don't have many
problems.

Leon

Don'cha love it when the author turns off the "Text Copy" tool on the
document so you can't copy and paste? Why they do that is beyond me. You
could print as many copies as you wish, or make infinite copies on a
Xerox machine. Why make it difficult to copy a couple of lines of text?

Another moan is when the author uses some wierd font that produces
garbage characters when you paste into a text editor. I often end up
shrinking the editor to a small window that overlays the pdf file, and do
a manual copy.

Then there's the text in a scanned image format. No copying, no searches,
and it takes a lot of room on the disk.

Hopefully, in 50 years or so, paper will be found only in museums, and
everyone will have flexible electronic displays. Since there will be no
need to print anything, searches will be easy, and there won't be a need
to use special fonts or lock the document for any reason. Life will be
easy for engineers.

Sure...

Mike Monett
 
S

Spehro Pefhany

Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany
 
T

Terry Pinnell

Spehro Pefhany said:
One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany

Thanks for all those prompt responses. I'll follow up the suggestions.

Using TextPad here - great editor.

Same result when pasting into various other apps. I shouldn't have
said returns after *every* character, but still pretty bad:
http://www.terrypin.dial.pipex.com/Images/PDFText1.gif
 
C

Chris

Terry Pinnell said:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

Couple of options.

Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or
Excel.

or 2.

download an alternative and quicker to open pdf reader from
www.foxitsiftware.com and use the text tool and paste into Excel. This will
give you a more coherent display but still not perfect.

Cheers
 
C

Chris

Chris said:
Couple of options.

Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or
Excel.

or 2.

download an alternative and quicker to open pdf reader from
www.foxitsiftware.com and use the text tool and paste into Excel. This will
give you a more coherent display but still not perfect.

Cheers

ooops
www.foxitsoftware.com
 
T

Terry Pinnell

Chris said:
Couple of options.

Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or
Excel.

or 2.

download an alternative and quicker to open pdf reader from
www.foxitsiftware.com and use the text tool and paste into Excel. This will
give you a more coherent display but still not perfect.

Cheers

Thanks. Yes, that is arguably an improvement:
http://www.terrypin.dial.pipex.com/Images/PDFText2.gif
compared to Adobe Acrobat Reader (5 in my case; each version seems to
get worse to me!):
http://www.terrypin.dial.pipex.com/Images/PDFText1.gif
but I see PDF Reader has pasted a fixed size font rather than the
original proportional?
 
T

Terry Pinnell

Thanks. Yes, that is arguably an improvement:
http://www.terrypin.dial.pipex.com/Images/PDFText2.gif
compared to Adobe Acrobat Reader (5 in my case; each version seems to
get worse to me!):
http://www.terrypin.dial.pipex.com/Images/PDFText1.gif
but I see PDF Reader has pasted a fixed size font rather than the
original proportional?

....but guess I must have used WordPad for the first! Don't recall
doing so - but can't think of any other explanation. So that makes pdf
reader definitely an improvement.
 
J

Jim Thompson

One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany

I'm using Adobe Acrobat 4... I have version 5, but it's been screwed
over by zealot programmers, so I only use it to read some stuff that
version 4 lacks font capability for.

With version 4 I get spaces with subscripted text, no <CR>; otherwise
looks OK.

...Jim Thompson
 
B

Boris Mohar

One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany

I use Clipmate http://www.thornsoft.com/ which has nice text cleanup.
Apparently it was not necessary for:

30A, 50V, 0.040 Ohm, N-Channel Power
MOSFET

It showed up as WYSIWYG
 
T

Ted Edwards

You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows,

Just downloaded it. Thanks. Wouldn't want to run it under 'doze
anyway. :)

BTW, Ghost Script/Ghost View extracts it with no problem. So does
Acrobat but it's easier with Ghost.

Ted
 
T

Ted Edwards

Terry said:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

Three suggestions:
Get PMView and use the screen capture => convert to 16 color => Save as
a .PNG. The file size for the max ratings is <6KB.
Install a virtual PostScript printer set to print to file.

You can grab anything with these tools.

Ted
 
Terry said:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

Using the Column Select tool in my Adobe Reader, I get:

Features
• 30A, 50V
• r
DS(ON)
= 0.040
Ω
• SOA is Power Dissipation Limited
• Nanosecond Switching Speeds
• Linear Transfer Characteristics
• High Input Impedance
• Majority Carrier Device
• Related Literature
- TB334 “Guidelines for Soldering Surface Mount
Components to PC Boards�

Which is close. Apparently when characters are in the symbol font, a
carriage return is inserted. My reader is version 5.0.5.

Doug
 
Terry said:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

Using the Column Select tool in my Adobe Reader, I get:

Features
• 30A, 50V
• r
DS(ON)
= 0.040
Ω
• SOA is Power Dissipation Limited
• Nanosecond Switching Speeds
• Linear Transfer Characteristics
• High Input Impedance
• Majority Carrier Device
• Related Literature
- TB334 “Guidelines for Soldering Surface Mount
Components to PC Boards�

Which is close. Apparently when characters are in the symbol font, a
carriage return is inserted. My reader is version 5.0.5.

Doug
 
T

Terry Pinnell

Ted Edwards said:
Three suggestions:
Get PMView and use the screen capture => convert to 16 color => Save as
a .PNG. The file size for the max ratings is <6KB.
Install a virtual PostScript printer set to print to file.

You can grab anything with these tools.

Ted

Thanks. I took a look at PMView but it seems to be just a (versatile)
image viewer, rather like several others (e.g. IrfanView), which can
also Print to File. Maybe I should explore the second part of your
recommendation; what 'virtual PostScript printer' do you use please?

BTW, I have Snagit, which can also capture *text* from many windows,
although it fails in the PDF example under discussion.
 
T

Ted Edwards

Terry said:
Thanks. I took a look at PMView but it seems to be just a (versatile)
image viewer, rather like several others (e.g. IrfanView), which can
also Print to File.

It is that but it also has a capture facility that allows capturing the
whole screen, a selected area of the screen, a window or the interior of
a window.

Maybe I should explore the second part of your
recommendation; what 'virtual PostScript printer' do you use please?

From your headers, I guess you are running 'doze. I'm not so I can
only give you general guidelines for what I did. Since I am printing to
file the physical printer does not need to be present at all. I picked
a high end colour laser printer and downloaded the postscript driver for
it. I installed it but checked the box that says "Print to file". I
also have a real Canon i850 on my system so when ever i send something
to the printer, I am given the choice of which of the two printers is to
be used. If I want real hard copy, I select the i850. If I want a file
I suggest the PostScript printer. With the later, I'm then asked for a
file, e.g. G:\downloads\glurp.ps. I can then convert that to PDF, PNG
or a choice of several other formats including "extract text" with Ghost
View.

Perhaps someone here who is a 'doze user can clarify this for you.

Ted
 
R

Rich Grise

What's your text editor? Assuming you're under Windows, perhaps the
problem is trying to paste Unicode into an editor that can't handle it.
You might try pasting the text into Word or Wordpad to see what happens.

You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows, but the command-line
utilities, including a PDF-to-text converter, will work.

Barely a day goes by that Slackware doesn't pleasantly surprise me!
It seems I got xpdf along with it, and lo and behold:
------------------------
30A, 50V, 0.040 Ohm, N-Channel Power
MOSFET
This is an N-Channel enhancement mode silicon gate power field effect
transistor designed for applications such as switching regulators,
switching converters, motor drivers, relay drivers and drivers for high
power bipolar switching transistors requiring high speed and low gate
drive power. This type can be operated directly from integrated circuits.
Formerly developmental type TA9771.
Ordering Information
PART NUMBER PACKAGE BRAND
BUZ11 TO-220AB BUZ11
NOTE: When ordering, use the entire part number.

Features
· 30A, 50V
· rDS(ON) = 0.040
· SOA is Power Dissipation Limited
· Nanosecond Switching Speeds
· Linear Transfer Characteristics
· High Input Impedance
· Majority Carrier Device
· Related Literature
- TB334 "Guidelines for Soldering Surface Mount
Components to PC Boards"
Symbol
D
G
S
 
Top